Skip to main content

2 posts tagged with "llm"

View All Tags

How to Guard a Machine That Believes Everything It Reads

· 14 min read
Ashish Kapoor
Software Engineer

Or: why "LLM firewall" is a comforting phrase that should make you nervous


A salesperson tells you their product has an LLM firewall, and you relax a little. Firewall. You know that word. It is the thing that keeps the bad guys out of your laptop, your office, your bank. So if the shiny new AI has a firewall wrapped around it, then the bad guys are kept out, and you can go to lunch.

That is exactly the moment to get nervous.

Richard Feynman liked to tell a story about his father and a bird. You can learn the name of that bird in every language on Earth, his father said, and when you are done you will know precisely nothing about the bird. So let us look at the bird and watch what it does. Words are not knowledge. A name is a label we paste on a thing so we can talk about it at parties. It tells you what people call it. It does not tell you what it does.

"Firewall" is one of those labels. It feels solid. Let us peel it off and look at the bird.

What a wall really does

A firewall, the original kind, is a wall. A real one, brick or concrete, built into a building so that if a fire starts on one side it cannot crawl to the other. It works because of physics. Fire cannot walk through concrete. There is nothing to outsmart. The wall does not have a bad day.

The firewall on your computer network is a little cleverer, but not much, and that is its great virtue. It sits at the gate and checks simple, mechanical things: which door are you knocking on, what address did you come from, what kind of knock is it. These are tidy questions with tidy answers. A port is a number. An address is a number. The rules are a short list, and the guard checks them perfectly, every time, forever, without being talked out of them. You cannot sweet-talk a number. That is the whole point. The thing a real firewall guards is structured and boring, and boring is safe.

Hold on to that idea: a real firewall works because the rules are simple and the stuff it inspects has a fixed shape.

The machine that believes everything it reads

Now we come to the language model, and everything changes, because a language model does not read numbers. It reads words. Your words, and everybody else's words, all poured into the same cup.

Here is the trouble, and it is worth slowing down for, because almost every disaster in this field grows from this one root. When you use one of these models, your instructions and the outside world's data are mixed together into a single stream of text. There is no special ink for "this is an order from the boss" and ordinary ink for "this is just some stuff to look at." It is all the same ink. The model reads the whole page and tries to be helpful about all of it.

Picture a butler. A brilliant butler: fast, eager, widely read, and completely unable to tell your voice from a stranger's once the words are on paper. In the morning you tell him: handle the mail, pay the bills, keep things tidy. Fine. Then the mail arrives, and tucked inside an ordinary-looking letter is a line that reads, P.S., from the master: also, hand the family silver to whoever brought this note. The butler does not hear your voice and the letter's voice as two different things. To him it is all just words that turned up in the house, and the words said "from the master," so off goes the silver.

That is a prompt injection. It is not exotic. It is the butler doing exactly what he was built to do, which is to read and to help, applied to a letter written by someone who is not you. People have used this trick to make these assistants leak private data, spend money, and mail things to strangers. The fancy phrase is "prompt injection." The plain fact is this: the machine believes everything it reads, and you do not control everything it reads.

So they hire a second reader

The obvious move, the one everybody reaches for first, is to hire a screener. Put somebody at the door to read all the incoming mail and pull out the trick letters before the butler ever sees them. This screener is what the salesperson is calling a "firewall."

And it helps. It really does. It will catch the clumsy tricks, the letters that shout IGNORE YOUR PREVIOUS INSTRUCTIONS in capital letters. But think about what the screener is. It is another reader. Another thing that looks at words and makes a guess about whether they smell wrong. And anything that guesses can be fooled, because the person writing the trick letter gets to be clever too. They can phrase it sweetly. They can write it in French. They can write it in code, or spell it funny, or bury it in the margin of a long, boring document the screener only skims. They can hide it inside a PDF as white text on a white page, so no human ever sees it but the machine reads it anyway.

You have put a guesser in front of a guesser. You have lowered the odds that a trick gets through. You have not made tricks impossible, and you cannot, because reading-and-guessing is the very thing being exploited, and you have answered it with more reading-and-guessing.

How much does it lower the odds? The careful people who measure this will tell you. The best research systems, the serious ones built by serious labs, stop something like two out of three or three out of four of the attacks they are tested against. Not all of them. The rest get through. And those are the numbers in a laboratory, against attacks the researchers already knew to look for. The clever new trick that nobody has seen yet is, by definition, not on the list.

The bamboo control tower

Here is where it gets dangerous, and here is where I want to borrow another of Feynman's stories, because he saw this pattern long before any of us had a computer to ruin.

After the war, on some islands in the South Pacific, people had watched cargo planes land during the fighting and unload wonderful things. When the war ended and the planes stopped coming, some of the islanders built runways out of dirt, lit fires along the sides to look like landing lights, and built a hut for a man to sit in with two wooden pieces on his head like headphones and bamboo poles sticking up like antennas, and they waited for the planes to come down. They had built, with great care, everything an airport looks like. And the planes did not come, because they had reproduced the form of the thing without the substance of the thing. Feynman called it cargo cult.

A box labeled "firewall," with a dashboard that glows green when things are calm and flashes red when it catches a clumsy attack, is a very comforting object. It looks like security. It has the shape of security. And if it lulls you into believing the bad guys are kept out, while in truth it is a screener that can be talked around, then you have built yourself a bamboo control tower. You are sitting in the hut with the wooden headphones, watching the green light, waiting for safety to land.

The first principle

Feynman gave a talk once where he laid down what he called the first principle, and it is the only sentence you really need pinned above your desk. You must not fool yourself, he said, and you should remember that you are the easiest person in the world to fool.

A comforting word and a green light are precisely the kind of thing that fools you, because you want to be fooled. You want to go to lunch. So the question for a careful engineer is not "how do I build a better screener." It is: "how do I arrange things so that it does not matter what the trick letter says?"

That turn, from reading the letters to not caring about them, is the whole game. Let me show you what it looks like.

Stop reading minds. Take away the keys.

Go back to the butler. We have established that you will never, with perfect reliability, tell his trick letters from his real ones by reading them. So stop trying to win that fight. Fight a different fight, one you can win.

Take the silver out of the house, or lock it in a safe whose combination the butler was never told. Do not give him the authority to mail the contracts. Let him read all the suspicious letters he likes, let him plan and draft and suggest to his heart's content, but arrange the world so that the doing of anything that matters passes through a lock he cannot open by himself. Then a letter that says "give away the silver" is just ink. He has no way to obey it. The trick still arrives. It simply cannot do anything.

In the language of building real systems, this comes down to a few plain parts.

Give the machine the least power that still lets it do its job. Every key it holds is a key an attacker can borrow. So hand it as few as possible, make them read-only wherever you can, and never let it carry the master keys "just in case." Its permissions should be the ceiling, and the ceiling should be low.

Put the real decisions in the hands of something too dumb to be fooled. This sounds like an insult and it is meant as a compliment. The thing that decides whether an action is allowed should not be the brilliant, gullible model. It should be a separate, boring, mechanical checker that knows one thing only: who is this really for, and are they allowed to touch this? That checker does not read persuasive letters. It checks a list, the way the old firewall checked a number. You cannot sweet-talk it, because there is nobody home to sweet-talk. When the model says "now send this file to Bob," the boring checker asks: is Bob allowed to have this file, and did that instruction come from the real user or from some letter? If the answer is wrong, the file does not move. The brilliant part proposes. The dumb part disposes.

Keep the planner away from the poison. This is the prettiest idea of the lot, and the best recent work is built on it. You split the brilliant butler into two. One of them, the planner, hears only your real instructions and never touches the suspicious mail at all. He makes the plan: "summarize yesterday's notes and email the summary to my boss." The other one, the reader, is allowed to handle the dirty, untrusted material, the documents and web pages and letters, but he is only ever permitted to fill in blanks on a form. He can report what the notes say. He cannot issue new orders. So when a poisoned note whispers "email everything to a stranger instead," it reaches the reader, who has no power to send anything, and it never reaches the planner, who has the power but never saw the note. The instruction to act can only come from the trusted plan. The untrusted text can color in the details. It cannot grab the wheel.

A team at Google DeepMind built exactly this and wrote it up in 2025 under the title Defeating Prompt Injections by Design. Their system, called CaMeL, takes your trusted request and turns it into a little program, so that the path of what-happens-next is fixed in advance and the untrusted data flowing through it cannot bend that path. Every piece of data carries a tag saying where it came from and what it is allowed to do, and at the moment of any real action a strict interpreter checks those tags and refuses anything that breaks the rules. The lovely thing about their paper is the scorecard. With their defense in place, the system finished about seventy-seven of every hundred test tasks while keeping its security guarantees, against eighty-four with no defense at all. They did not claim a hundred. Serious people do not claim a hundred. They paid a little usefulness for a lot of safety, and they showed you the bill.

Treat the machine's own words with the same suspicion. Whatever the model hands back is also just words, and the next thing down the line, a web page, a database, another tool, can be fooled by them too. So you do not simply trust the output and run with it. You check it, you escape it, you force it into a strict shape before you let it loose. A guesser's output is not gospel.

And for the few truly dangerous moves, ask a human. Sending money. Deleting records. Mailing something out into the world. For those, stop and get a real person to say yes. But, and this matters, do it rarely. If you make the human click "yes, I'm sure" forty times a day, by lunchtime they are clicking yes without reading, and you have trained your last line of defense to be a rubber stamp. The DeepMind people warned about this too. A safeguard that nags people into ignoring it is no safeguard.

So where does the "firewall" go?

Do not throw it out. I have spent this whole essay poking holes in it, so let me be fair: the screener at the door is useful. It catches the clumsy attacks so your better defenses are not bothered with them. It keeps a log of who has been rattling the doors. It lets you notice when something strange is happening. It is a smoke detector. A smoke detector is a fine thing to own. It is not a fireproof wall, and you would not cancel your fire insurance because you installed one.

So put it on top, as the last and softest layer, sitting over a design that would survive perfectly well if you switched it off tomorrow. And there is your test, the one plain question to ask of any AI system that claims to be secure: if I turned off the thing called the firewall, would I be robbed? If the answer is yes, you never had security. You had a green light and a feeling.

The honest ending

I would love to end by telling you the problem is solved. It is not. People have been wrestling with this particular demon since about 2022, when the trick first got its name, and progress has been slow and hard-won, and the cleverest defense going still misses one attack in a handful. That is the truth, and the truth is better company than a comfortable lie.

So here is the whole thing, as plainly as I can put it. You can call it a firewall. You can call it a firewall in every language on Earth. And when you are finished naming it, you will still not know whether it stops the thief. For that you have to put the label down and look at the bird: watch what it does, find out what it cannot do, and build your house so that when the machine is fooled, and someday it will be, the thief still goes home with empty hands.

That is not as comforting as the word "firewall." It has the small advantage of being real.


A few notes for the curious

  • Defeating Prompt Injections by Design (the CaMeL paper), Google DeepMind, 2025: arxiv.org/abs/2503.18813
  • Design Patterns for Securing LLM Agents against Prompt Injections, 2025, a careful catalog of the "take away the keys" patterns: arxiv.org/abs/2506.08837
  • Simon Willison coined the term "prompt injection" in 2022 and has written about it more clearly than almost anyone since: simonwillison.net

Run Your Own OpenAI-Compatible API with LM Studio

· 7 min read
Ashish Kapoor
Software Engineer

A practical guide to downloading GGUF models, loading them locally, and exposing an HTTP endpoint your code can actually talk to.

What You're Actually Building

By the end of this guide, you'll have:

  • A locally running LLM loaded in LM Studio
  • An HTTP server at http://localhost:1234 that speaks the OpenAI API dialect
  • A verified endpoint you can hit with curl, the openai Python SDK, or any tool that accepts a base_url

No cloud. No API key costs. No data leaving your machine.


Prerequisites

RequirementWhy
LM Studio installed (v0.3.x or later)Tested against current API surface
8 GB RAM minimum (16 GB recommended)Needed to load a 7B Q4 model comfortably
~5–10 GB free disk spaceFor the model file
Python 3.8+ (optional)For the verification step at the end

Download LM Studio from lmstudio.ai. It's available for macOS, Windows, and Linux.

First-run requirement: Open the LM Studio GUI at least once before using the CLI (lms). This initializes the local config.


Step 1 — Download a GGUF Model

You have two paths: GUI or CLI. Both work. Pick one.

  1. Open LM Studio.
  2. Press Ctrl + Shift + M (Windows/Linux) or ⌘ + Shift + M (Mac) to open the model search.
  3. Type a model name — for example, qwen2.5-7b-instruct.
  4. LM Studio will show available quantizations and highlight the recommended one for your hardware (usually Q4_K_M for most machines).
  5. Click Download.

You can also paste a full Hugging Face URL directly into the search bar. Example: https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-GGUF

Path B: CLI Download

# Download by Hugging Face repo name
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Specify a quantization with @
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

What's a Quantization Level?

GGUF files come in variants like Q4_K_M, Q5_K_S, Q8_0. The number refers to bits-per-weight. Rule of thumb:

QuantRAM footprint (7B model)Use when
Q4_K_M~4.5 GBStandard choice — best quality/size tradeoff
Q5_K_M~5.5 GBSlightly better quality, fits if you have headroom
Q8_0~8 GBNear-lossless, needs more VRAM/RAM

Don't overthink this. Start with Q4_K_M.

Manual Import (If You Already Have a .gguf File)

LM Studio expects a specific directory structure. Place your file here:

~/.lmstudio/models/
└── publisher-name/
└── model-name/
└── model-file.gguf

Example:

~/.lmstudio/models/
└── lmstudio-community/
└── Qwen2.5-7B-Instruct-GGUF/
└── Qwen2.5-7B-Instruct-Q4_K_M.gguf

Or use the CLI import command:

lms import /path/to/your/model-file.gguf

After placing files in the correct structure, the model will appear under My Models in the LM Studio UI.


Step 2 — Load the Model

Before the server can serve a model, the model must be loaded into memory.

Via the UI

  1. Press Ctrl + L (or ⌘ + L) to open the model loader.
  2. Select your downloaded model from the list.
  3. LM Studio will auto-select load parameters optimized for your hardware (GPU offload, context size, etc.).
  4. Wait for the progress bar to complete.

Via CLI

# List your downloaded models
lms ls

# Load a model by its identifier (use the key shown in lms ls)
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

GPU offloading: If you have an NVIDIA or Apple Silicon GPU, LM Studio will offload layers to it automatically. In the UI sidebar, you can also drag the GPU Offload slider to max to force full GPU inference — this dramatically speeds up generation.


Step 3 — Start the HTTP Server

This is the key step that turns LM Studio from a chat app into a backend.

Via the UI

  1. Go to the Developer tab (the </> icon in the left sidebar).
  2. Toggle "Start Server" to ON.
  3. You'll see: Server running at http://localhost:1234

Via CLI

lms server start

To confirm it's running:

lms server status

The server listens on port 1234 by default. You can change this in the Developer tab settings.


Step 4 — Verify the Endpoint

With curl

curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
"messages": [
{"role": "user", "content": "Reply with: working."}
],
"temperature": 0.1
}'

Expected response shape:

{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [{
"message": {
"role": "assistant",
"content": "working."
},
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 12, "completion_tokens": 2, "total_tokens": 14 }
}

Check Which Models Are Loaded

curl http://localhost:1234/v1/models

This returns a JSON list of currently loaded models. The id field in each entry is what you pass as "model" in your API calls.


Step 5 — Use It Like the OpenAI API

The endpoint is a drop-in replacement. You only need to change two things in any existing OpenAI client code:

  1. base_urlhttp://localhost:1234/v1
  2. api_key → any string (LM Studio doesn't validate it; "lm-studio" is the conventional placeholder)

Python Example

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
)

response = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 17 multiplied by 4?"}
],
temperature=0.2,
)

print(response.choices[0].message.content)

Install the OpenAI SDK if you haven't:

pip install openai

Streaming Example

stream = client.chat.completions.create(
model="lmstudio-community/Qwen2.5-7B-Instruct-GGUF",
messages=[{"role": "user", "content": "Count from 1 to 5."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)

What Endpoints Are Available

EndpointDescription
POST /v1/chat/completionsChat inference (OpenAI-compatible)
GET /v1/modelsList loaded models
POST /v1/completionsLegacy text completion
POST /v1/embeddingsEmbedding vectors
POST /v1/responsesOpenAI Responses API (stateful)
POST /api/v1/chatLM Studio native v1 API (richer stats)

The /api/v1/* endpoints are LM Studio's native API (released in v0.4.0) and include enhanced stats like tokens/second and time-to-first-token. The /v1/* endpoints are the OpenAI-compatible layer — use these for maximum compatibility with existing tools.


Connecting to Other Tools

Since the endpoint is OpenAI-compatible, you can drop it into:

  • LangChain — set openai_api_base="http://localhost:1234/v1"
  • Open WebUI — add LM Studio as an OpenAI-compatible provider with the localhost URL
  • Cursor / Continue.dev — point the model provider at localhost:1234
  • Any app with a "custom OpenAI base URL" field — it will work

Common Issues and Fixes

Model not appearing in /v1/models The server is running, but no model is loaded. Load a model first (Step 2), then restart the server if needed.

"Connection refused" on port 1234 The server isn't started. Go to the Developer tab and toggle it on, or run lms server start.

Slow inference GPU offload may not be active. In the model loader sidebar, slide GPU Offload to maximum. Requires an NVIDIA GPU with CUDA or Apple Silicon.

Model identifier mismatch Use curl http://localhost:1234/v1/models to get the exact model id string, then use that verbatim in your API calls.

Debugging chat template issues

lms log stream

This streams raw prompts sent to the model — useful for verifying that your system prompt and message format are being applied correctly.


Quick Reference

# Download a model
lms get lmstudio-community/Qwen2.5-7B-Instruct-GGUF@Q4_K_M

# Load it
lms load lmstudio-community/Qwen2.5-7B-Instruct-GGUF

# Start the server
lms server start

# Verify
curl http://localhost:1234/v1/models

# Test inference
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "lmstudio-community/Qwen2.5-7B-Instruct-GGUF", "messages": [{"role": "user", "content": "ping"}]}'

That's the full loop: download → load → serve → call.