Your Mac Is a Model Server

(Here's how to treat it like one.)

Apr 01, 2026

You know that moment on a plane where you open your laptop, pull up your coding agent, and remember that it lives on someone else’s server? United’s wifi is doing its thing — which is to say, nothing — and you’re sitting there with an M4 Max that could run a 35-billion-parameter model but can’t complete a function call. This note is about fixing that, and a few hundred other decisions like it.

There are guides for running open-weight models locally. What’s missing is which model, which agent, and why. So, that’s what this is. By the end of this, you’ll have a working Claude Code equivalent running (kinda) entirely on your own hardware.

What You're Building

Three pieces: a local inference server (llama.cpp) that speaks OpenAI-compatible HTTP, a model you’ve chosen consciously with a license you’ve actually read, and an agent that routes tasks to it.

The agent I’m replacing Claude Code with is OpenCode — same terminal TUI feel, same agentic loop, reads your repo, proposes changes, executes tools, commits. If you’d rather stay in your editor, VS Code + Continue gets you something close to Cursor pointed at your local server. This guide focuses on OpenCode. The rest of the setup is identical either way.

Cursor bills me every time I look at it. This runs on hardware already sitting on my lap, and the marginal cost of the next token is zero.

Step 1: Install the Runtime

brew install llama.cpp fzf hf

llama.cpp is the inference server. Metal GPU support is on by default for Apple Silicon. fzf is needed by OpenCode for fuzzy search. hf is the official Hugging Face CLI — the right way to download models.

Confirm it worked:

llama-server --version

Step 2: Pick and Download Your Model

The benchmark tables are mostly useless for this decision. Three questions matter: Does it fit in your RAM, does it call tools reliably, and can you actually build on the license?

Start with how much RAM you have:

system_profiler SPHardwareDataType | grep -E “Chip|Memory”

16GB
→ Model: Qwen2.5-Coder-7B-Instruct
→ License: Apache 2.0
→ Size: ~5GB
→ Fast, reliable, great starting point

32–48GB
→ Model: gpt-oss-20b
→ License: MIT
→ Size: ~12GB
→ Best for tool use + agent workflows

64GB+
→ Model: Qwen3.5-35B-A3B
→ License: Apache 2.0
→ Size: ~22GB
→ More powerful, MoE architecture

Download commands:

# 16GB
hf download Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --local-dir ~/models

# 32-48GB
hf download ggml-org/gpt-oss-20b-GGUF \
  gpt-oss-20b-mxfp4.gguf \
  --local-dir ~/models

# 64GB+
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --local-dir ~/models

These downloads are large (5–22GB). Start one; go get coffee.

The 16GB and 64GB+ models are Q4_K_M - 4-bit quantization, the standard quality/size tradeoff. You lose a small amount of quality, gain a large reduction in size and memory. For most coding tasks, you won’t notice. The 32-48GB model is different: mxfp4 is how gpt-oss was actually trained, so there’s nothing to lose.

All three models in this issue are Apache 2.0 or MIT — you can build on them commercially without restrictions. (That’s not true of every popular model.) Before you add anything else to your stack, read the license file in the repo, not the marketing copy on the model card.

Case in point: yesterday, Alibaba released Qwen3.5-Omni — their new multimodal model — as API-only. No weights, no license file, no download link. The previous version was fully open. The researcher most associated with Alibaba’s open-source work left the company earlier this month. None of this changes what’s in your ~/models directory right now — those weights aren’t going anywhere. But it’s the kind of signal that makes me want to be better at evaluating alternatives, not worse. Next issue: how to actually do that with Kimi and DeepSeek.

Step 3: Start the Server

llama-server \
  -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --jinja \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  -ngl 99

Two flags worth explaining:

--jinja enables tool-calling. Skip it and your agent will appear to work but silently refuse to call tools. This flag is responsible for most of the “it won’t use tools” failures I’ve seen.
-ngl 99 offloads all model layers to Metal GPU. Skip it on Apple Silicon and you’re running on CPU — ten times slower than it should be.

Wait for this line before doing anything else:

main: server is listening on http://127.0.0.1:8080

Sanity check:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H “Content-Type: application/json” \
  -d ‘{
    “model”: “local”,
    “messages”: [{”role”: “user”, “content”: “say hi in 5 words”}],
    “max_tokens”: 20
  }’

If you get a response, your stack is working. The first time I ran this I immediately opened Activity Monitor to watch the GPU. Everything after this is just routing clients at it.

Step 4: The Agent - OpenCode

My friends at anoma.ly built OpenCode as the open-source Claude Code equivalent. Same terminal TUI feel, same agentic loop. The difference: You’re not locked to Anthropic’s models, pricing, or terms.

(Timing note: As I was literally about to hit send, Anthropic accidentally shipped Claude Code’s entire source — 512,000 lines of TypeScript — in a .map file on npm. Nobody hacked anything. The front door was open. Within hours, a developer named Sigrid Jin used a different coding agent — OpenAI’s Codex — to rewrite the whole harness in Python before sunrise. The repo hit 48,000 stars in a day. A coding agent cloned a coding agent in one session. It’s days old, legally radioactive, and the author ships a parity_audit.py that’s refreshingly honest about what’s missing. I am not recommending you use it. I am noting that the universe has a sense of humor about the rent-vs.-own conversation.)

I did brew install opencode first. It installed, but it didn’t work. Thirty minutes later, I found the tap.

brew install anomalyco/tap/opencode

Pre-install the provider package. OpenCode downloads this at startup, but it hangs silently if npm is slow. Do it manually once and you’ll never hit this:

mkdir -p ~/.cache/opencode
cd ~/.cache/opencode
npm install @ai-sdk/openai-compatible

Now, create two config files, nothing else. In ~/.config/opencode/opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llamacpp/qwen3.5-35b",
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "qwen3.5-35b": {
          "name": "Qwen3.5-35B-A3B",
          "tools": true,
          "limit": {
            "context": 32768,
            "output": 8192
          }
        }
      }
    }
  }
}

and in ~/.local/share/opencode/auth.json

{
  "llamacpp": {
    "apiKey": "sk-local"
  }
}

Launch (server must be running first in a separate terminal):

mkdir ~/test-project && cd ~/test-project
opencode

OpenCode will ask if you want to configure providers. Anthropic, OpenAI, a dozen others. You can skip every single one. You don’t have a remote platform. You don’t need a key. That moment — closing out of a credentials prompt with nothing in it — is the whole point of this newsletter in one keystroke.

Want proof that it works? Type this in the input box:`

write a python script that prints ‘hello world’

It does. Obviously. That’s not proof of anything except that your wiring is correct. Let’s make it actually work for its dinner.

create a tetris clone in typescript that can be served statically out of a dist/ directory

OpenCode maps out a plan, scaffolds the project, writes the code. I open the browser and…

Uncaught SyntaxError: unexpected token: ‘as’ app.js:2:20

One error. I pulled it from the browser console, pasted it back into OpenCode, and it fixed it. Served it again. Played Tetris — on a model running on my laptop, with no API call, no subscription, no telemetry.

I should be honest about what that means. If I’d done this in Claude Code with Opus 4.6, it probably gets it right on the first try. I know it would in Cursor — I’ve run this exact test enough times to be bored by it. A local 35B model making one fixable mistake on a from-scratch Tetris clone is genuinely good. It is not frontier-model good. That’s the tradeoff you’re signing up for, and you should know it going in.

That’s the whole stack: local model, Metal GPU inference, OpenAI-compatible API, open-source agent, tool-calling. Nothing left my machine.

What You’re Actually Getting Into

The tooling is fast-moving and occasionally broken. OpenCode ships near-daily updates. The standard Homebrew formula lags. The startup hang — silently trying to download a provider package with no progress indicator — is a known issue with a simple manual workaround, which is why the setup above pre-installs it. This will improve. Worth knowing what you’re getting into.

Tool calling is fragile. The model has to be trained for it and templated correctly. --jinja is the most common fix. Second most common: The model you chose wasn’t trained for tool use. Qwen3.5 and gpt-oss-20b are; not all models are.

Context size costs RAM. The KV cache is the model’s working memory for a conversation — it stores computed state for every token in context, across every layer of the model. It grows linearly with context length, and at long contexts can rival the model weights themselves in memory. Start at 32k. Increase only if you need it and have confirmed headroom.

It’s faster than you might expect. Running Qwen3.5-35B on an M4 Max: it’s semi speedy! Comfortable reading speed is around 15-20. You are not waiting on this model. Where you will feel it is autocomplete — round-trip latency for inline suggestions is longer than a hosted API, and for that use case it’s a real regression. For agent tasks where you kick something off and come back, it’s a non-issue.

It works on a plane. No wifi, no API, no problem. I’ve done more useful work in airplane mode with a local model than I ever did frantically tethering before descent. Slower latency is a real tradeoff. Offline is a real feature. Decide which one you’re optimizing for on any given day.

The Decision Framework: What Stays Local, What Goes Cloud

Local wins on:

Proprietary code you shouldn’t be sending to an external API
Repetitive, high-volume tasks where cost is the actual constraint
Anything where predictable latency matters more than raw speed
Anywhere without reliable internet: planes, trains, a cabin, an SCIF

Cloud still wins on:

Tasks that genuinely require frontier capability: complex, multi-step reasoning, novel architecture work
Very long context where you don’t have the RAM to compete locally
One-off tasks where setup time exceeds cost savings

The useful test: could a competent senior engineer do this task well? If yes, a good local model probably can, too. If the task requires something closer to “principal engineer with five years of context on your entire codebase,” you might still want the cloud.

The goal isn’t running everything locally. The goal is making the decision on purpose.

Now tell me about yours. What model are you running? What’s your setup? I’m particularly curious about the Kimi and DeepSeek models — next issue, let’s talk about how to actually evaluate and choose between them.

What I’m Reading This Week

The new SVG engine Heerich isn’t open source AI, but it is open source. And gorgeous.

Everyone’s arguing about which LLM to run locally. Meanwhile, LeCun’s lab shipped a world model — a system that doesn’t predict the next token, it predicts what happens next in a latent representation of the physical world. That distinction matters: token prediction gives you autocomplete; state prediction gives you planning, reasoning, and systems that can actually act on things. LeWorldModel does it in 15M params, one GPU, two loss terms, no pretrained encoder, 48x faster planning than foundation-model approaches. The JEPA bet is that you don’t need to reconstruct every pixel to understand cause and effect — just the compressed state that matters. Code’s on GitHub.

I think the surveillance world we’re creating is generally creepy. (Part of the wonder of human memory is the ability to forget, to get fuzzy.) But I love this trend of people vibe-coding tools they want for themselves, like this app that makes your intent and decisions searchable by your AI. Of course, this is also the end of every SaaS company on the planet…

After virtual-world interaction, the next thing we all have to think about is robots and physical-world interaction. Maybe I should give these plans to my son to print for his robot hand…

I haven’t had a chance to play with this yet, but I’m definitely intrigued. The bare-metal inference engine runs a ~400B parameter MoE model on a laptop (10x larger than the ones I’m talking about above.) If it works, frontier models don’t have to live in the cloud, trading latency for independence.

Digging into this paper on optimizing the harness w.r.t. end performance.

Here’s the scientifically backed version of a piece I wrote about sycophancy. Stanford researchers found that AI reinforces users’ harmful prompts 47% more than humans. Turns out humans are more than okay with that.

Discussion about this post

Ready for more?