Run Your Own OpenAI-Compatible API with LM Studio
A practical guide to downloading GGUF models, loading them locally, and exposing an HTTP endpoint your code can actually talk to.
What You're Actually Building
By the end of this guide, you'll have:
- A locally running LLM loaded in LM Studio
- An HTTP server at
http://localhost:1234that speaks the OpenAI API dialect - A verified endpoint you can hit with
curl, theopenaiPython SDK, or any tool that accepts abase_url
No cloud. No API key costs. No data leaving your machine.
Prerequisites
| Requirement | Why |
|---|---|
| LM Studio installed (v0.3.x or later) | Tested against current API surface |
| 8 GB RAM minimum (16 GB recommended) | Needed to load a 7B Q4 model comfortably |
| ~5–10 GB free disk space | For the model file |
| Python 3.8+ (optional) | For the verification step at the end |
Download LM Studio from lmstudio.ai. It's available for macOS, Windows, and Linux.
First-run requirement: Open the LM Studio GUI at least once before using the CLI (
lms). This initializes the local config.
Step 1 — Download a GGUF Model
You have two paths: GUI or CLI. Both work. Pick one.
Path A: In-App Search (Recommended for First-Timers)
- Open LM Studio.
- Press Ctrl + Shift + M (Windows/Linux) or ⌘ + Shift + M (Mac) to open the model search.
- Type a model name — for example,
qwen2.5-7b-instruct. - LM Studio will show available quantizations and highlight the recommended one for your hardware (usually
Q4_K_Mfor most machines). - Click Download.
You can also paste a full Hugging Face URL directly into the search bar. Example:
https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-GGUF
Path B: CLI Download
# Download by Hugging Face repo name
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF
# Specify a quantization with @
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M
What's a Quantization Level?
GGUF files come in variants like Q4_K_M, Q5_K_S, Q8_0. The number refers to bits-per-weight. Rule of thumb:
| Quant | RAM footprint (7B model) | Use when |
|---|---|---|
| Q4_K_M | ~4.5 GB | Standard choice — best quality/size tradeoff |
| Q5_K_M | ~5.5 GB | Slightly better quality, fits if you have headroom |
| Q8_0 | ~8 GB | Near-lossless, needs more VRAM/RAM |
Don't overthink this. Start with Q4_K_M.
Manual Import (If You Already Have a .gguf File)
LM Studio expects a specific directory structure. Place your file here:
~/.lmstudio/models/
└── publisher-name/
└── model-name/
└── model-file.gguf
Example:
~/.lmstudio/models/
└── lmstudio-community/
└── Qwen2.5-7B-Instruct-GGUF/
└── Qwen2.5-7B-Instruct-Q4_K_M.gguf
Or use the CLI import command:
lms import /path/to/your/model-file.gguf
After placing files in the correct structure, the model will appear under My Models in the LM Studio UI.
Step 2 — Load the Model
Before the server can serve a model, the model must be loaded into memory.
Via the UI
- Press Ctrl + L (or ⌘ + L) to open the model loader.
- Select your downloaded model from the list.
- LM Studio will auto-select load parameters optimized for your hardware (GPU offload, context size, etc.).
- Wait for the progress bar to complete.
Via CLI
# List your downloaded models
lms ls
# Load a model by its identifier (use the key shown in lms ls)
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF
GPU offloading: If you have an NVIDIA or Apple Silicon GPU, LM Studio will offload layers to it automatically. In the UI sidebar, you can also drag the GPU Offload slider to max to force full GPU inference — this dramatically speeds up generation.
Step 3 — Start the HTTP Server
This is the key step that turns LM Studio from a chat app into a backend.
Via the UI
- Go to the Developer tab (the
</>icon in the left sidebar). - Toggle "Start Server" to ON.
- You'll see:
Server running at http://localhost:1234
Via CLI
lms server start
To confirm it's running:
lms server status
The server listens on port 1234 by default. You can change this in the Developer tab settings.
Step 4 — Verify the Endpoint
With curl
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
"messages": [
{"role": "user", "content": "Reply with: working."}
],
"temperature": 0.1
}'
Expected response shape:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [{
"message": {
"role": "assistant",
"content": "working."
},
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 12, "completion_tokens": 2, "total_tokens": 14 }
}
Check Which Models Are Loaded
curl http://localhost:1234/v1/models
This returns a JSON list of currently loaded models. The id field in each entry is what you pass as "model" in your API calls.
Step 5 — Use It Like the OpenAI API
The endpoint is a drop-in replacement. You only need to change two things in any existing OpenAI client code:
base_url→http://localhost:1234/v1api_key→ any string (LM Studio doesn't validate it;"lm-studio"is the conventional placeholder)
Python Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
)
response = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 17 multiplied by 4?"}
],
temperature=0.2,
)
print(response.choices[0].message.content)
Install the OpenAI SDK if you haven't:
pip install openai
Streaming Example
stream = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[{"role": "user", "content": "Count from 1 to 5."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
What Endpoints Are Available
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | Chat inference (OpenAI-compatible) |
GET /v1/models | List loaded models |
POST /v1/completions | Legacy text completion |
POST /v1/embeddings | Embedding vectors |
POST /v1/responses | OpenAI Responses API (stateful) |
POST /api/v1/chat | LM Studio native v1 API (richer stats) |
The /api/v1/* endpoints are LM Studio's native API (released in v0.4.0) and include enhanced stats like tokens/second and time-to-first-token. The /v1/* endpoints are the OpenAI-compatible layer — use these for maximum compatibility with existing tools.
Connecting to Other Tools
Since the endpoint is OpenAI-compatible, you can drop it into:
- LangChain — set
openai_api_base="http://localhost:1234/v1" - Open WebUI — add LM Studio as an OpenAI-compatible provider with the localhost URL
- Cursor / Continue.dev — point the model provider at
localhost:1234 - Any app with a "custom OpenAI base URL" field — it will work
Common Issues and Fixes
Model not appearing in /v1/models
The server is running, but no model is loaded. Load a model first (Step 2), then restart the server if needed.
"Connection refused" on port 1234
The server isn't started. Go to the Developer tab and toggle it on, or run lms server start.
Slow inference GPU offload may not be active. In the model loader sidebar, slide GPU Offload to maximum. Requires an NVIDIA GPU with CUDA or Apple Silicon.
Model identifier mismatch
Use curl http://localhost:1234/v1/models to get the exact model id string, then use that verbatim in your API calls.
Debugging chat template issues
lms log stream
This streams raw prompts sent to the model — useful for verifying that your system prompt and message format are being applied correctly.
Quick Reference
# Download a model
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M
# Load it
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF
# Start the server
lms server start
# Verify
curl http://localhost:1234/v1/models
# Test inference
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF", "messages": [{"role": "user", "content": "ping"}]}'
That's the full loop: download → load → serve → call.



