Run Your Own OpenAI-Compatible API with LM Studio

April 28, 2026 · 7 min read

Software Engineer

A practical guide to downloading GGUF models, loading them locally, and exposing an HTTP endpoint your code can actually talk to.

What You're Actually Building

By the end of this guide, you'll have:

A locally running LLM loaded in LM Studio
An HTTP server at http://localhost:1234 that speaks the OpenAI API dialect
A verified endpoint you can hit with curl, the openai Python SDK, or any tool that accepts a base_url

No cloud. No API key costs. No data leaving your machine.

Prerequisites

Requirement	Why
LM Studio installed (v0.3.x or later)	Tested against current API surface
8 GB RAM minimum (16 GB recommended)	Needed to load a 7B Q4 model comfortably
~5–10 GB free disk space	For the model file
Python 3.8+ (optional)	For the verification step at the end

Download LM Studio from lmstudio.ai. It's available for macOS, Windows, and Linux.

First-run requirement: Open the LM Studio GUI at least once before using the CLI (lms). This initializes the local config.

Step 1 — Download a GGUF Model

You have two paths: GUI or CLI. Both work. Pick one.

Path A: In-App Search (Recommended for First-Timers)

Open LM Studio.
Press Ctrl + Shift + M (Windows/Linux) or ⌘ + Shift + M (Mac) to open the model search.
Type a model name — for example, qwen2.5-7b-instruct.
LM Studio will show available quantizations and highlight the recommended one for your hardware (usually Q4_K_M for most machines).
Click Download.

You can also paste a full Hugging Face URL directly into the search bar. Example: https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-GGUF

Path B: CLI Download

# Download by Hugging Face repo name
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Specify a quantization with @
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

What's a Quantization Level?

GGUF files come in variants like Q4_K_M, Q5_K_S, Q8_0. The number refers to bits-per-weight. Rule of thumb:

Quant	RAM footprint (7B model)	Use when
Q4_K_M	~4.5 GB	Standard choice — best quality/size tradeoff
Q5_K_M	~5.5 GB	Slightly better quality, fits if you have headroom
Q8_0	~8 GB	Near-lossless, needs more VRAM/RAM

Don't overthink this. Start with Q4_K_M.

Manual Import (If You Already Have a .gguf File)

LM Studio expects a specific directory structure. Place your file here:

~/.lmstudio/models/
└── publisher-name/
    └── model-name/
        └── model-file.gguf

Example:

~/.lmstudio/models/
└── lmstudio-community/
    └── Qwen2.5-7B-Instruct-GGUF/
        └── Qwen2.5-7B-Instruct-Q4_K_M.gguf

Or use the CLI import command:

lms import /path/to/your/model-file.gguf

After placing files in the correct structure, the model will appear under My Models in the LM Studio UI.

Step 2 — Load the Model

Before the server can serve a model, the model must be loaded into memory.

Via the UI

Press Ctrl + L (or ⌘ + L) to open the model loader.
Select your downloaded model from the list.
LM Studio will auto-select load parameters optimized for your hardware (GPU offload, context size, etc.).
Wait for the progress bar to complete.

Via CLI

# List your downloaded models
lms ls

# Load a model by its identifier (use the key shown in lms ls)
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

GPU offloading: If you have an NVIDIA or Apple Silicon GPU, LM Studio will offload layers to it automatically. In the UI sidebar, you can also drag the GPU Offload slider to max to force full GPU inference — this dramatically speeds up generation.

Step 3 — Start the HTTP Server

This is the key step that turns LM Studio from a chat app into a backend.

Via the UI

Go to the Developer tab (the </> icon in the left sidebar).
Toggle "Start Server" to ON.
You'll see: Server running at http://localhost:1234

Via CLI

lms server start

To confirm it's running:

lms server status

The server listens on port 1234 by default. You can change this in the Developer tab settings.

Step 4 — Verify the Endpoint

With curl

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
    "messages": [
      {"role": "user", "content": "Reply with: working."}
    ],
    "temperature": 0.1
  }'

Expected response shape:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "working."
    },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 12, "completion_tokens": 2, "total_tokens": 14 }
}

Check Which Models Are Loaded

curl http://localhost:1234/v1/models

This returns a JSON list of currently loaded models. The id field in each entry is what you pass as "model" in your API calls.

Step 5 — Use It Like the OpenAI API

The endpoint is a drop-in replacement. You only need to change two things in any existing OpenAI client code:

base_url → http://localhost:1234/v1
api_key → any string (LM Studio doesn't validate it; "lm-studio" is the conventional placeholder)

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
)

response = client.chat.completions.create(
    model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 17 multiplied by 4?"}
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

Install the OpenAI SDK if you haven't:

pip install openai

Streaming Example

stream = client.chat.completions.create(
    model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
    messages=[{"role": "user", "content": "Count from 1 to 5."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

What Endpoints Are Available

Endpoint	Description
`POST /v1/chat/completions`	Chat inference (OpenAI-compatible)
`GET /v1/models`	List loaded models
`POST /v1/completions`	Legacy text completion
`POST /v1/embeddings`	Embedding vectors
`POST /v1/responses`	OpenAI Responses API (stateful)
`POST /api/v1/chat`	LM Studio native v1 API (richer stats)

The /api/v1/* endpoints are LM Studio's native API (released in v0.4.0) and include enhanced stats like tokens/second and time-to-first-token. The /v1/* endpoints are the OpenAI-compatible layer — use these for maximum compatibility with existing tools.

Connecting to Other Tools

Since the endpoint is OpenAI-compatible, you can drop it into:

LangChain — set openai_api_base="http://localhost:1234/v1"
Open WebUI — add LM Studio as an OpenAI-compatible provider with the localhost URL
Cursor / Continue.dev — point the model provider at localhost:1234
Any app with a "custom OpenAI base URL" field — it will work

Common Issues and Fixes

Model not appearing in /v1/models The server is running, but no model is loaded. Load a model first (Step 2), then restart the server if needed.

"Connection refused" on port 1234 The server isn't started. Go to the Developer tab and toggle it on, or run lms server start.

Slow inference GPU offload may not be active. In the model loader sidebar, slide GPU Offload to maximum. Requires an NVIDIA GPU with CUDA or Apple Silicon.

Model identifier mismatch Use curl http://localhost:1234/v1/models to get the exact model id string, then use that verbatim in your API calls.

Debugging chat template issues

lms log stream

This streams raw prompts sent to the model — useful for verifying that your system prompt and message format are being applied correctly.

Quick Reference

# Download a model
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

# Load it
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Start the server
lms server start

# Verify
curl http://localhost:1234/v1/models

# Test inference
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF", "messages": [{"role": "user", "content": "ping"}]}'

That's the full loop: download → load → serve → call.

What You're Actually Building​

Prerequisites​

Step 1 — Download a GGUF Model​

Path A: In-App Search (Recommended for First-Timers)​

Path B: CLI Download​

What's a Quantization Level?​

Manual Import (If You Already Have a .gguf File)​

Step 2 — Load the Model​

Via the UI​

Via CLI​

Step 3 — Start the HTTP Server​

Via the UI​

Via CLI​

Step 4 — Verify the Endpoint​

With curl​

Check Which Models Are Loaded​

Step 5 — Use It Like the OpenAI API​

Python Example​

Streaming Example​

What Endpoints Are Available​

Connecting to Other Tools​

Common Issues and Fixes​

Quick Reference​

What You're Actually Building

Prerequisites

Step 1 — Download a GGUF Model

Path A: In-App Search (Recommended for First-Timers)

Path B: CLI Download

What's a Quantization Level?

Manual Import (If You Already Have a .gguf File)

Step 2 — Load the Model

Via the UI

Via CLI

Step 3 — Start the HTTP Server

Via the UI

Via CLI

Step 4 — Verify the Endpoint

With curl

Check Which Models Are Loaded

Step 5 — Use It Like the OpenAI API

Python Example

Streaming Example

What Endpoints Are Available

Connecting to Other Tools

Common Issues and Fixes

Quick Reference