Skip to main content

Run Your Own OpenAI-Compatible API with LM Studio

· 7 min read
Ashish Kapoor
Software Engineer

A practical guide to downloading GGUF models, loading them locally, and exposing an HTTP endpoint your code can actually talk to.

What You're Actually Building

By the end of this guide, you'll have:

  • A locally running LLM loaded in LM Studio
  • An HTTP server at http://localhost:1234 that speaks the OpenAI API dialect
  • A verified endpoint you can hit with curl, the openai Python SDK, or any tool that accepts a base_url

No cloud. No API key costs. No data leaving your machine.


Prerequisites

RequirementWhy
LM Studio installed (v0.3.x or later)Tested against current API surface
8 GB RAM minimum (16 GB recommended)Needed to load a 7B Q4 model comfortably
~5–10 GB free disk spaceFor the model file
Python 3.8+ (optional)For the verification step at the end

Download LM Studio from lmstudio.ai. It's available for macOS, Windows, and Linux.

First-run requirement: Open the LM Studio GUI at least once before using the CLI (lms). This initializes the local config.


Step 1 — Download a GGUF Model

You have two paths: GUI or CLI. Both work. Pick one.

  1. Open LM Studio.
  2. Press Ctrl + Shift + M (Windows/Linux) or ⌘ + Shift + M (Mac) to open the model search.
  3. Type a model name — for example, qwen2.5-7b-instruct.
  4. LM Studio will show available quantizations and highlight the recommended one for your hardware (usually Q4_K_M for most machines).
  5. Click Download.

You can also paste a full Hugging Face URL directly into the search bar. Example: https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-GGUF

Path B: CLI Download

# Download by Hugging Face repo name
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Specify a quantization with @
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

What's a Quantization Level?

GGUF files come in variants like Q4_K_M, Q5_K_S, Q8_0. The number refers to bits-per-weight. Rule of thumb:

QuantRAM footprint (7B model)Use when
Q4_K_M~4.5 GBStandard choice — best quality/size tradeoff
Q5_K_M~5.5 GBSlightly better quality, fits if you have headroom
Q8_0~8 GBNear-lossless, needs more VRAM/RAM

Don't overthink this. Start with Q4_K_M.

Manual Import (If You Already Have a .gguf File)

LM Studio expects a specific directory structure. Place your file here:

~/.lmstudio/models/
└── publisher-name/
└── model-name/
└── model-file.gguf

Example:

~/.lmstudio/models/
└── lmstudio-community/
└── Qwen2.5-7B-Instruct-GGUF/
└── Qwen2.5-7B-Instruct-Q4_K_M.gguf

Or use the CLI import command:

lms import /path/to/your/model-file.gguf

After placing files in the correct structure, the model will appear under My Models in the LM Studio UI.


Step 2 — Load the Model

Before the server can serve a model, the model must be loaded into memory.

Via the UI

  1. Press Ctrl + L (or ⌘ + L) to open the model loader.
  2. Select your downloaded model from the list.
  3. LM Studio will auto-select load parameters optimized for your hardware (GPU offload, context size, etc.).
  4. Wait for the progress bar to complete.

Via CLI

# List your downloaded models
lms ls

# Load a model by its identifier (use the key shown in lms ls)
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

GPU offloading: If you have an NVIDIA or Apple Silicon GPU, LM Studio will offload layers to it automatically. In the UI sidebar, you can also drag the GPU Offload slider to max to force full GPU inference — this dramatically speeds up generation.


Step 3 — Start the HTTP Server

This is the key step that turns LM Studio from a chat app into a backend.

Via the UI

  1. Go to the Developer tab (the </> icon in the left sidebar).
  2. Toggle "Start Server" to ON.
  3. You'll see: Server running at http://localhost:1234

Via CLI

lms server start

To confirm it's running:

lms server status

The server listens on port 1234 by default. You can change this in the Developer tab settings.


Step 4 — Verify the Endpoint

With curl

curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
"messages": [
{"role": "user", "content": "Reply with: working."}
],
"temperature": 0.1
}'

Expected response shape:

{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [{
"message": {
"role": "assistant",
"content": "working."
},
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 12, "completion_tokens": 2, "total_tokens": 14 }
}

Check Which Models Are Loaded

curl http://localhost:1234/v1/models

This returns a JSON list of currently loaded models. The id field in each entry is what you pass as "model" in your API calls.


Step 5 — Use It Like the OpenAI API

The endpoint is a drop-in replacement. You only need to change two things in any existing OpenAI client code:

  1. base_urlhttp://localhost:1234/v1
  2. api_key → any string (LM Studio doesn't validate it; "lm-studio" is the conventional placeholder)

Python Example

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
)

response = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 17 multiplied by 4?"}
],
temperature=0.2,
)

print(response.choices[0].message.content)

Install the OpenAI SDK if you haven't:

pip install openai

Streaming Example

stream = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[{"role": "user", "content": "Count from 1 to 5."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)

What Endpoints Are Available

EndpointDescription
POST /v1/chat/completionsChat inference (OpenAI-compatible)
GET /v1/modelsList loaded models
POST /v1/completionsLegacy text completion
POST /v1/embeddingsEmbedding vectors
POST /v1/responsesOpenAI Responses API (stateful)
POST /api/v1/chatLM Studio native v1 API (richer stats)

The /api/v1/* endpoints are LM Studio's native API (released in v0.4.0) and include enhanced stats like tokens/second and time-to-first-token. The /v1/* endpoints are the OpenAI-compatible layer — use these for maximum compatibility with existing tools.


Connecting to Other Tools

Since the endpoint is OpenAI-compatible, you can drop it into:

  • LangChain — set openai_api_base="http://localhost:1234/v1"
  • Open WebUI — add LM Studio as an OpenAI-compatible provider with the localhost URL
  • Cursor / Continue.dev — point the model provider at localhost:1234
  • Any app with a "custom OpenAI base URL" field — it will work

Common Issues and Fixes

Model not appearing in /v1/models The server is running, but no model is loaded. Load a model first (Step 2), then restart the server if needed.

"Connection refused" on port 1234 The server isn't started. Go to the Developer tab and toggle it on, or run lms server start.

Slow inference GPU offload may not be active. In the model loader sidebar, slide GPU Offload to maximum. Requires an NVIDIA GPU with CUDA or Apple Silicon.

Model identifier mismatch Use curl http://localhost:1234/v1/models to get the exact model id string, then use that verbatim in your API calls.

Debugging chat template issues

lms log stream

This streams raw prompts sent to the model — useful for verifying that your system prompt and message format are being applied correctly.


Quick Reference

# Download a model
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

# Load it
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Start the server
lms server start

# Verify
curl http://localhost:1234/v1/models

# Test inference
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF", "messages": [{"role": "user", "content": "ping"}]}'

That's the full loop: download → load → serve → call.