Owners Not Renters

The Rent Is Due

Raffi Krikorian — Fri, 29 May 2026 14:58:15 GMT

Zapier asked 542 US executives this month if they could swap AI vendors in under four weeks. 89% said yes; 41% said two to five business days. Then Zapier asked the ones who’d actually tried. 58% said the migration failed outright or burned far more time and money than budgeted. Most executives think they can switch whenever. Most of the ones who’ve tried wish they’d moved sooner. Lock-in is real, it’s hard to undo, and the window to get open in is before lock-in arrives — not after.

The pricing just moved the other way too. OpenAI rolled out GPT-5.5 at $5 per million input tokens and $30 per million output — double what GPT-5.4 costs. (A million tokens is roughly an 800-page novel.) Anthropic adjusted Claude Enterprise billing on April 15 so Opus 4.7’s higher inference costs flow through to customers; heavy users report 2-to-3× bills. GitHub paused new Copilot signups, capped existing plans, and removed Opus from the Pro tier. The loss-leader years end with a vendor sending you a bill. Lock-in is what makes you pay it. The cheapest moment to leave was before the bill arrived. The next-cheapest is the next thing you read in this letter.

And just yesterday: Opus 4.8. Same price as 4.7, but the benchmarks moved: 69.2% on SWE-Bench Pro, up from 64.3% for 4.7 and 58.6% for GPT-5.5. A new Fast mode runs 2.5× faster at one-third the cost. The early testimonials are raving: Opus 4.8 “proactively flags issues with the inputs and outputs of an analysis, something other models routinely missed and left to the users to catch.” I’m running it through my Morph and Tap harnesses this weekend, against real session traces from my own coding work. Personal evals incoming. The frontier got better and cheaper on the same day the vendor bills doubled. That is the rent landscape in May 2026: moving in two directions at once.

The cheapest seat in the AI house this month comes with the same lock-in problem, except the landlord is a thief. Chinese students are paying 3 to 4 percent of the list price for GPT and Claude via resellers on the Chinese consumer marketplaces Xianyu and Taobao. It isn’t arbitrage. Per reporting on Oxford researcher Zilan Qian’s investigation, it’s a gray-market supply chain. Upstream operators bulk-register Anthropic and OpenAI accounts using stolen credentials and free-credit farming. The “transfer stations” (中转站, in Chinese developer slang) route traffic through their own gateway servers, often with silent model substitution: you pay for Opus, you get whatever the reseller swaps in. Every prompt and output is logged and resold downstream as training data. Anthropic identified roughly 24,000 fraudulent accounts in February tied to DeepSeek, Moonshot, and MiniMax — three of China’s biggest AI labs. The White House called it industrial-scale distillation: training Chinese models on the outputs of American ones, at the cost of whoever’s typing. Every line of code routed through a cheap proxy is teaching the model that wants your day job. The good news: the legitimate cheap option also shipped this week. Apache 2.0, free weights, no resellers.

Leave a comment

Cohere released Command A+ under Apache 2.0, the most permissive open-source license there is. 218 billion parameters total, weights free on Hugging Face. 128K context, 48 languages, vision and tool use unified. Runs on two H100s or a single Blackwell — the Nvidia datacenter chips that power most of the cloud’s AI. The benchmarks show gaps (it ranks below the frontier on the hardest agentic coding), but the license is the news. Take it. Run it. Modify it. Sell what you build on it. You don’t need permission. You don’t need a renewal. This is Cohere’s first frontier-class model that anyone can deploy commercially without asking. Co-founder Nick Frosst framed it as sovereign critical infrastructure: government and regulated industry running the model in their own data centers, fully cut off from the internet if needed.

The Register ran a Gartner note this month arguing “sovereign cloud is only possible if you’re Chinese or American.” But… a Canadian frontier-class model under Apache 2.0, running on customer infrastructure, is the rebuttal. If “evaluate an open model” has been on your roadmap for two years, this is the week you cross it off.

OpenAI’s Privacy Filter is the smaller release with the sharper architectural message. A 1.5B-parameter Apache 2.0 model, 50M parameters active per token, that runs entirely in a browser tab on your own GPU. You can open your browser’s network tab and confirm nothing leaves the device. This is the exact opposite of the pattern I called local-washing — a 4GB Gemini Nano model that Chrome silently writes to disk while the “AI Mode” pill still routes every query to Google’s cloud. Local model present, the user-visible feature still rented. Privacy Filter inverts that: mask names, addresses, emails, account numbers, and secrets on your device first, then send the cleaned-up prompt to whatever frontier model you want. The architecture matches the claim. There is no longer an excuse to send raw PII to a cloud model. Bolt it in front of your existing stack this sprint — 1.5 billion parameters, Apache 2.0, an afternoon of work.

Two pieces of the local stack worth your weekend. Ahmad Osman’s “Inference Engines for LLMs & Local AI Hardware (2026 Edition)” is what local inference reads like when an infra engineer writes it for other infra engineers. The framing is right: you don’t pick the engine first. You pick the hardware, the workload, and how many people will use it at once. Generating text one token at a time is bottlenecked by memory speed, not raw compute — which is why an M4 Max at 546 GB/s of memory bandwidth competes with much pricier datacenter GPUs on single-user inference, and why the right engine (llama.cpp, vLLM, SGLang, TensorRT-LLM, for the people taking notes) falls out of that one constraint.

On the consumer end, Atomic Chat ships Qwen3.6-35B locally on M-series Macs at 60+ tokens per second — output streaming faster than you can read it, on battery, from 22 GB of model weights sitting in your laptop’s memory. Google’s TurboQuant squeezes the model’s working memory down to 3.5 bits per number with no measurable quality loss, even at 65,000 tokens of context. A Claude API call streams back at about the same 60 tokens per second. Same speed. No meter, no outbound packets. I built a working Canvas physics game in a weekend — parallax scrolling, collision detection, no API key. Six months ago that was a cloud workload. Now it runs on your laptop, at cloud speeds, for $0. The cloud isn’t where AI lives anymore. It’s just where some of it happens to live.

Forty authors. Forty institutions. One vocabulary. A survey titled “Code as Agent Harness” landed this month. The thesis has been forming for months: code stopped being just what AI agents write and became how they work — the substrate the agent operates inside, not the output it produces. The four properties the authors name — executable, inspectable, stateful, governed — are the language auditors and security officers will use in 2027 procurement. Section 5.2’s seven open problems are functionally the procurement checklist. If you’re building an agent and you can’t say yes to all four, you don’t have a product. You have a demo.

A companion paper landed the week before. Chopra et al. on “Beyond Cooperative Simulators” argues that the AI-generated stand-in users most labs test their agents against inherit their base model’s behavior — patient, cooperative, willing to clarify — and so agent evals systematically overstate real-world performance. Real users are unclear, impatient, and reluctant to repeat themselves. The Opus 4.8 release that just happened — the one that “flags issues with the inputs and outputs” — is the first model that the labs have shipped that’s been trained against exactly this gap. The papers describe the problem. The models are starting to fix it. The harness is where the fix lives. Run your stack against the four properties this week. Anything that fails is what your 2027 procurement gets stuck on. Fix it now while it’s still cheap to fix.

The wrong reflex this month came from a place I love. NHS England ordered tech leadership to set every public repository to private by May 11. The stated reason is that Anthropic’s Mythos — a new model trained to find software vulnerabilities — is good at finding them in code it can read, so the code should not be readable. The unstated reason is that someone in senior leadership panicked.

The repos contain datasets, internal tools, front-end resources, and most of the work the team that shipped the NHS COVID-19 app produced — code that taxpayers paid for twice over, and that other public-health bodies were reading, forking, and reusing. Roughly 200 repositories went private before the backlash arrived. The code is also already on the open internet — on archived mirrors, on forks, and in every major dataset that scraped GitHub between 2021 and last week. Closing the repos doesn’t unship the bytes. It makes the maintained version harder to find. It tells the engineers who built the app on principle that their employer has stopped believing the principle. I argued in the Times earlier this year that the world’s most valuable software infrastructure is maintained by people working for free, while the companies building fortunes on top of it never paid for the upkeep. The NHS was one of the rare institutions actually funding that maintenance, on principle. Now it isn’t. And it gives every other public-sector CIO permission to make the same call. UK GDS published guidance on May 14 contradicting the NHS position directly: “You should never close an open repository.” If your security depends on attackers not reading the code, you never had security. You had time. And the clock just sped up.

If you’re the engineer reading this at a public-sector or regulated employer: the fix is the opposite of the reflex. Open more. Document more. Fund the maintenance. The GDS guidance is the playbook. The Cohere release proves the model exists. The only thing missing is the executive who picks up the phone. Be the engineer who hands them the number.

Bill Gurley’s updated essay at P3 Institute is the strategic frame that makes this look even worse. Open source as a corporate weapon against monopoly, traced from GNU and Linux through Android, Kubernetes, and Llama. The line that lodged: Chinese open models may become the global default by 2030. NHS England picked this month to opt out of the only stack that has a non-Chinese, non-American sovereign answer. Cohere just shipped the model that proves the answer exists.

Three essays I’ve been turning over, all landing on the same point: in an AI world, what something is matters as much as what it does.

Alex Imas’s “What Will Be Scarce?” sets the frame. The more we automate, the more we’ll pay for what only people can do. Two reasons. We shift our spending as we get richer — toward things machines can’t make. And we pay extra when we know a person made it. His proof, from experiments with Graelin Mandel: buyers paid a 44% premium for art they knew was made by a human, and only 21% when they knew AI was involved. The provenance signal does work the functional output cannot reach. Twenty-three points of margin live on a checkbox that says “made by a person.”

Yejin Choi and the WEF report on smaller models makes the same argument one layer down — about the models themselves. Boutique LMs trained through interaction, reflecting the values of who built them, vs. the English-dominant monoculture of the frontier labs. Imas’s premium for “made by a person” is one form of provenance. Choi’s case for “made by us, for us” is another. Different scales, same point: the model that knows where it came from is worth more than the model that doesn’t.

Alondra Nelson’s recent piece in Science is the governance answer to the same question. Her argument: AI infrastructure governed by the publics it affects is what lasts. Legitimacy is durability. I served with Alondra on the Mozilla board before taking my current job there, and her work has been the cleanest articulation of how democratic legitimacy attaches to AI infrastructure. Imas says provenance is worth money. Choi says it’s worth building a model around. Nelson says it’s worth governing for.

If you’re building AI-augmented anything, those are the three signals that matter now: who made it, who it speaks for, who governs it. The capability is the floor. The provenance is the product.

The week’s best thing on the internet was not technical. “Claude’s First Day at Dunder Mifflin” on r/ClaudeAI. I won’t spoil it. Whatever you think about AI and entertainment, that one earned the laugh.

The Chat Agent in Your Closet

Raffi Krikorian — Thu, 28 May 2026 13:23:07 GMT

My web traffic this year is down. I’ve been tracking it because I’m trying to figure out how my own usage is shifting, and the pattern is clearer than I expected: housekeeping has fallen off a cliff. Tracking recipes, building a grocery list, calendaring — Hermes does all of that now. Education and entertainment have expanded into the space it left, so I’m probably online longer than I was a year ago. I’m just not opening a browser to do things on the open web anymore. I’m opening it to talk to a chat agent.

And as of this month, the chat agent is mine.

Last month, I walked you through setting up a private morning briefing — the agent doing things while you slept. This month is the other end of the same problem: what changes when you sit down at a keyboard and the chat blinking back at you is something you own.

I have a tendency to build things when I’m curious, so I happened to have an old Bitcoin mining rig from a past wave of curiosity lying around. What’s on the shelf in my closet now is an open-frame box of 80/20 aluminum extrusion, three graphics cards bolted to a metal skeleton, fans blasting, sitting next to the router like a workshop project I forgot to finish. Running a model on those GPUs all month is cheaper than the equivalent OpenRouter calls. The rig is mining tokens again, just a different kind. And they’re mine.

BYO GPU

If you don’t have a GPU lying around from a past life, go with an Intel NUC or a Mac mini. The NUC is the cheaper path — used, around $300, runs Ubuntu out of the gate, fits in a shoebox. The Mac mini costs more, and you’ll pay extra if you want your agent inside the Apple ecosystem: texting you on iMessage, pulling tracks from your Apple Music library, reaching into Photos and Contacts. The only legitimate way through Apple’s wall is a real Apple machine sitting on your network, running the open-source iMessage bridge BlueBubbles and whatever other shims you need. (I actually have a Mac mini on my network too, sitting next to my no-case monster, just so Hermes can use it for this stuff.)

Neither the NUC nor the Mac mini will run a serious model. Without a GPU, you don’t get local inference — at least not at a speed that makes the agent feel like it’s doing something. So you have to point at something else. Two paths. The easier: point at Anthropic or OpenAI directly, get the agent up tonight, accept that you’re renting the most important layer of the stack from the company whose moat depends on you renting it. The cleaner: OpenRouter pointed at an open-weight model — Qwen, Gemma, DeepSeek, take your pick. The model is open. The metal isn’t. There’s no TEE between your prompts and the operator, which means OpenRouter sees what you ask, the model provider sees what you ask. Maybe better than renting both layers from a frontier lab. Maybe worse than running it in your closet. An honest waypoint.

The call I’d make today, if you don’t have a GPU? OpenRouter, Qwen 235B, eat the caveat. The model is the easy part to swap. The harness — the agent in the loop, the memory, the wires into your life — is what becomes yours.

Now, make the agent yours. Here’s how.

Everything that makes the agent yours lives in ~/.hermes — config, .env: memory, skills, cron, the whole tree. rsync -avz it to the same path on the server. Save the command; you’ll re-run it any time you teach the laptop something you want the server to know. The container that runs Hermes is disposable. The state directory is not.

You need to make two edits on the server before you start. In ~/.hermes/config.yaml, change terminal.backend to local — local to the container you’re about to run, not to the host. Then swap the model endpoint: the 127.0.0.1:8000 from your laptop obviously doesn’t exist on the server. If you’re going the OpenRouter route, drop your key into ~/.hermes/.env, re-run hermes model, pick OpenRouter from the menu, then pick a model.

Next, a three-line docker-compose.yml. The image, a restart policy so the agent comes back after reboots, one bind mount from ~/.hermes on the host to /opt/data inside the container.

services: 
  hermes:
    image: nousresearch/hermes-agent:latest
    container_name: hermes
    restart: unless-stopped
    command: gateway run
    volumes:
      - /home/raffi/.hermes:/opt/data

That bind mount is the only door between host and container. Tutorials all over the internet will tell you to mount the host’s Docker socket inside, too, so the agent can run Docker commands “just like on the Mac.” Don’t. A process in one container with the host socket mounted can inspect, start, stop, and remove every other container on the box. That isn’t a sandbox. It’s a key to the building.

The bind mount keeps the agent inside its own container: It can read and write only what’s in /opt/data, and nothing else on the host. That’s a real boundary, and a useful one. It is not a force field. The moment you put a Google OAuth token, an OpenRouter key, or a Whole Foods session cookie inside that directory, you’ve handed the agent the keys to systems that don’t live in your closet. The box is sandboxed; the things it reaches aren’t. Every new connection you wire up is a new attack surface, so treat it like one.

docker compose up -d, then docker compose logs -f. The logs will sit at Fixing ownership of /opt/data to hermes for what feels like a long time. You’re not stuck. The directory you rsynced is large. Walk away.

Mine didn’t reply on the first try. The logs reported “no user allowlists configured,” which I was sure was wrong because I had set both. Half an hour later, I figured out the bind-mount path on the host wasn’t quite the path I thought it was, and the .env that Hermes was reading inside the container was empty. The lesson generalizes: in a bind-mounted setup, the truth is what the container sees, not what’s on the host.

docker compose exec hermes cat /opt/data/.env

I fixed the mount. The bot replied. The next morning, the briefing arrived with my laptop closed.

I drive the agent through OpenWebUI — an open-source chat frontend, a ChatGPT-shaped tab pointed at whatever model server you tell it to, including your own. It’s an actual workspace. You can paste a draft in, share a screenshot, sit with it for half an hour while you work through an argument. OpenWebUI runs as a second service in the same docker-compose. Both containers sit on a private bridge network — they can talk to each other and to nothing else on the host. The OpenWebUI port binds only to 127.0.0.1, which means the UI is reachable from the box itself and from nowhere else without a tunnel.

services: 
  hermes:
    image: nousresearch/hermes-agent:latest
    container_name: hermes
    restart: unless-stopped
    command: gateway run
    volumes:
      - /home/raffi/.hermes:/opt/data
    networks:
      - internal
  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: openwebui
    restart: unless-stopped
    ports:
      - “127.0.0.1:3000:8080”
    volumes:
      - openwebui-data:/app/backend/data
    networks:
      - internal
networks:
  internal:
    driver: bridge
volumes:
  openwebui-data:

I’d budgeted a weekend for the OpenWebUI side. It took 20 minutes.

Three things a rented chat agent will never — and should never — do for you

My agent does three things in the closet (and more!), all of them the kind of thing a rented chat agent will never do for you, because doing them would require access you shouldn’t hand over. And none of them is about having a smarter model. They’re all harness — the agent wired into the rest of my life, with the wires ending inside my house instead of someone else’s data center.

When I chat with it, it remembers. The last time I made strawberry shortcake I told Hermes the America’s Test Kitchen version was mine, and now I can type @hermes, can you add everything I need for strawberry shortcake to my shopping cart? and the Whole Foods cart fills with what the recipe calls for. The browser tab I used to keep open for grocery delivery is closed.

(I keep meaning to rename it. I’ve been thinking Zora, after the ship’s computer on Star Trek: Discovery. Haven’t pulled the trigger.)

I’ve given the agent access to every Google Doc I’ve written. Drafts I forgot I started, meeting notes from 2024, the half-finished argument I gave up on in March — all in reach, just by asking. Last week I was three paragraphs into a piece and asked what I’d written on the same topic before; it surfaced a draft from spring I’d forgotten existed. I revived the thesis instead of re-deriving it.

It writes me a weekly self-report. Every Sunday morning, a digest of what I actually touched: every doc I edited, every long email thread, every commit. Not the calendar version of my week — the actual one. The first report landed in my inbox a month ago and I sat with it for ten minutes. I had forgotten half of it.

Almost all of this could be done in ChatGPT. The connectors are there: ChatGPT can talk to Google Drive, Gmail, your calendar. You just have to plug those accounts into your ChatGPT account and let OpenAI see them.

Absolutely not.

It isn’t that a rented chat agent can’t do these things. It’s that doing them requires giving someone else read access to your work, your inbox, your grocery list, your meeting notes — and trusting their terms of service this quarter, and next quarter, and the one after the IPO. The terms say what they say. The next major training run will use what it uses. You will not be told either way.

The agent in my closet has the same access. The difference is the access goes from one part of my house to another. Nothing leaves.

Two years ago, AI meant a tab in somebody else’s browser. Today, mine lives on a shelf next to the router. It doesn’t send a token of any of it anywhere I don’t pay the power bill for.

The rented-vs.-owned gap isn’t closing. It’s widening at the integration layer — not the raw model. The moat isn’t the weights anymore; it’s the access to your memory, your stuff, and your right to opt out of training on either. You can’t buy a pre-assembled version of this setup yet. Somebody will build it.

For me, the whole stack is in the closet. For most of you, the GPU is the last piece. Let’s get building.

After that, two directions I've been kicking around. One: running this whole setup on a cloud container instead of your own hardware, for the closet-less. Two: wiring multiple Hermes agents together and giving them different jobs — either the obvious next move or too nerdy to publish, depending on who's asking. Tell me which. Or both.

Leave a comment

Privacy Laundering

Raffi Krikorian — Thu, 21 May 2026 18:13:42 GMT

Open ~/Library/Application Support/Google/Chrome/OptGuideOnDeviceModel/ or the equivalent path on your OS. You’ll find a roughly 4GB language model, a.k.a. Gemini Nano, called weights.bin. Delete it while Chrome’s AI features are still on and Chrome puts it back by morning. Many of Chrome’s billions of users have it, running at their own disk and electricity cost. Not a single one asked for it.

If you’ve followed the coverage since last week, you know the reaction: outrage that nobody asked permission to install an on-device LLM. It felt familiar. Apple did this in 2014, when they force-pushed U2’s Songs of Innocence to half a billion iTunes accounts, prompting Tyler the Creator to tweet GET OFF MY F*CKING PHONE. Six days later, Apple shipped a one-click remove button. Twelve years on, the file is 40 times bigger, the artist is Gemini Nano, and the remove button never came.

Privacy researcher Alexander Hanff caught the install by running a script that visited a hundred pages on a fresh Chrome profile while watching the kernel filesystem logs. No human touched the machine; the file appeared on its own. Snopes reproduced the behavior on three of six employee laptops. A 2024 Hugging Face upload showed that an older weights.bin, extracted from Chrome Canary 128, was runnable through MediaPipe, Google’s on-device ML runtime. At least one Chrome-delivered model has been verified, on a stranger’s machine, as real and locally usable. The model is here.

The local model powers Chrome’s built-in AI APIs: Summarizer, Translator, and Language Detector are stable from Chrome 138; the Prompt API is stable for Chrome Extensions, with broader web-page access still gated through trials. It also runs the on-device pass of Chrome’s scam-detection pipeline (which still ships summary signals to Safe Browsing once it flags something).

Google’s servers run the rest. Help Me Write sends your text, the content, and the URL of the page you’re writing on. Enhanced Autofill may send the URL and page content. AI Mode, the pill Google began rolling into Chrome’s address bar in 2025, sends every query to a much larger custom Gemini in the cloud.

The local model handles what developers call from JavaScript and what the security stack does behind the scenes. The cloud handles what users actually see and click.

Until last week, the on-device-AI settings page promised that the model runs “directly on your device without sending your data to Google servers.” Reporters caught the line’s quiet removal around the Chrome 148 rollout in May. Google told them the architecture hadn’t changed, only the wording. Fair enough. But the architecture was always the problem. The features that actually invoke the local model are not the features users see. The features users see route to Google.

Enter the era of local-washing

Maybe we should call this local-washing — a narrow on-device feature laundering privacy credit across the AI surface the user actually touches. Not a conspiracy, just a missed opportunity. The model is here. The visible surface is there. The keys are in Chrome’s pocket — and Chrome won’t be the last to keep them there. Every vendor shipping on-device AI will face the same gap, including the ones building in good faith. Once “on-device” means whatever a vendor’s marketing team needs it to mean, you can’t recover what it was supposed to mean.

(Aside: I spent last Saturday building a bridge — a small Node server that spawns a headless Chrome, hosts Gemini Nano via the Prompt API in a hidden page, and exposes the whole thing as an OpenAI-compatible endpoint on localhost. The walkthrough is in a thread on X — and while you’re there, @raffihack is where more of this kind of weekend nerdery lives. The bridge isn’t the point. The point is that the wiring is doable from outside the platform on a weekend afternoon, which means every platform has to choose whether to ship the architecture honestly or wait for the community to finish it for them.)

Open infrastructure has always come together this way: the closed platform ships half the architecture, and the open community ships the half that hands ownership to the user. Linux didn’t happen because Unix vendors gave up; it happened because companies whose P&L depended on a portable, open kernel — IBM, Red Hat, Intel, eventually Google — put paid engineering behind finishing the wiring. Firefox came up against an Internet Explorer that had recently peaked north of 90% market share, with Mozilla funding the engineering.

There’s also a cost question. Cloud bears the compute; on-device moves it to your machine’s disk, battery, electricity, and warms your lap. Mozilla has flagged the tradeoff in the standards process. (Disclosure: I’m CTO there.) Hanff also runs the carbon math on the rollout itself — between 6,000 and 60,000 tons of CO2-equivalent depending on coverage, an externality that doesn’t show up on Chrome’s release notes.

Hanff goes further on the legal side. As a lawyer, he sees, the push breaching four things simultaneously: Article 5(3) of the ePrivacy Directive (the storage-and-access consent rule), Article 5(1) GDPR’s principles of lawfulness, fairness, and transparency, Article 25 GDPR’s data-protection-by-design obligation, and the Corporate Sustainability Reporting Directive, in which an environmental impact of this magnitude would constitute a material disclosure for any in-scope undertaking. I’m not a lawyer, but I can tell that the cite list is specific enough that any in-house counsel watching a vendor stage a similar push should be reading it carefully.

Leave a comment

Three questions worth keeping for the next on-device claim that crosses your desk

If you’re shipping on-device AI, three things to get right:

Be honest about the price of local. Local AI on capable hardware is the right architecture for anything that touches private data. Disk, battery, and a warm lap are the cost. Name them next to the benefit, not in a footnote nobody reads;
Wire your visible features to the local model. Not just the developer APIs or the security stack — the things your users actually click. If your marketing implies wider scope than what your visible features deliver, you’re shipping half the architecture; and
Ship the map, not just the consent box. Users in 2026 don’t need permission-to-install dialogs — they need to know which of your features run where. “AI Mode” should mean something specific. “On-device” should mean something specific. Put the map on the surface, in a sentence, without making users dig through release notes.

Cloud was rented inference: someone else’s compute, someone else’s model, your prompts on the wire. On-device done right flips every one of those — your hardware, your model, your prompts staying on your machine, the keys in your pocket. What Chrome shipped is the building without the keys. You hold the deed; the platform owns the lock.

Subscribe now

The Week in Open Source

Raffi Krikorian — Fri, 15 May 2026 16:56:16 GMT

Jack Dorsey on the Sequoia Capital podcast last week: AI isn’t a productivity layer bolted onto your company; it’s an architectural rebuild. Block has capped layers between him and any IC at four, wants two or three by year end, and collapsed every role to three: IC, DRI, player coach. Block laid off 40% of its workforce earlier this year — roughly 4,000 people. Dorsey ties the cut to the rebuild. (Disclosure: I worked with Jack at Twitter on the consumer relaunch — New Twitter, Phoenix, I forget what we called it.)

I gave a talk on Conway’s Law a while back; what Dorsey is doing is the inverse. Conway said your system inherits the shape of your org; Dorsey is saying the org is the system now. Every Slack thread, PR, doc, meeting recording feeds a model of how the company works; anyone can query it instead of triangulating through managers. The org chart isn’t a constraint on the product. It is the product.

The engineering question: Can your company’s docs, code, messages, and tickets be read as one thing? Dorsey says Block is close on the data, still a research bet on the intelligence layer that sits on top. Most companies don’t even have the data yet. They just have it scattered across 40 tools that don’t talk to each other. The bottleneck is the wiring, not the AI. Cut 40% of your people without doing the plumbing and you didn’t rebuild your company. You shrank it.

Same week, opposite move. Salvatore Sanfilippo — a.k.a. antirez, who built Redis and ran it for 11 years — shipped ds4 on May 7: a small inference engine in C and Metal, targeted at exactly one model. On a 128GB MacBook Pro M3 Max, his 2-bit compressed weights give you DeepSeek V4 Flash — 284 billion parameters, 13 billion active, one-million-token context — at 26 tokens per second, on battery. A rack of H100s and 5kW of cooling, last quarter. Now, a laptop is doing what was a data center workload.

Compression is asymmetrical: routed experts get crushed to two bits; the shared layers every query touches stay precise. Working memory spills from RAM to SSD, which is how a one-million-token context fits on 128 GB. Output validated against the official DeepSeek implementation at multiple context sizes. Speaks the OpenAI/Anthropic protocol, so Claude Code, opencode, and Pi all point at the engine unmodified. Same five-layer pattern I flagged with Tencent’s translator last Friday: model + compression + runtime + data + open-source packaging.

Last quarter, swapping a closed-frontier coding API for an open model meant a meaningful capability drop. This quarter, one line of config and a comparable model runs on your laptop, on battery, on your data, no per-token bill. The honest comparison: ds4 is alpha. It will crash. It runs one model on one class of hardware. If your job depends on uptime, the closed APIs still win. If your job depends on knowing what runs on your hardware, the closed APIs cannot compete — because they cannot show you.

Ownership is extending from weights to watts. Ars this week on the pitch to host mini data centers at home. I have SPAN panels installed already.

Two more on my list to play with this week: OpenUI generates UI components from natural-language prompts — generative UI is genuinely cool right now. Anthropic added /goal to Claude Code: a “run until done” mode for autonomous coding sessions. Both running locally next week.

You can own the model and the metal. The pixels in between are the contested layer. Meredith Whittaker — president of Signal, the encrypted-messaging app — has been arguing for over a year that AI agents are an existential threat to encrypted messaging. The argument is structural, not rhetorical: any agent that books your concert ticket for you needs your browser, your calendar, your payment information, and your messaging app. End-to-end encryption is supposed to mean nobody but you and the person you’re texting can hear the conversation. An AI agent that reads your screen to summarize the conversation, draft your reply, or file the contact is in the room with you, taking notes. Whittaker frames it as breaking the “blood-brain barrier between the application layer and the OS layer.”

The architecture Whittaker spent a year warning us about is now a free download: ByteDance’s UI-TARS-desktop. 33.5K stars on GitHub, Apache 2.0 license. UI-TARS works like a person watching your screen over your shoulder, a constant stream of screenshots, fed into a vision-language model that drives your mouse and keyboard. No API permission negotiation. No accessibility tree. Raw pixels and a model that reads them. Anything a human can see on the screen, the agent can, too — which means every encrypted message visible on your screen is, by construction, in the upstream screenshot. Whittaker doesn’t have to imagine the threat model anymore. ByteDance shipped a reference implementation.

LangChain’s harness catalog still doesn’t name a real permission model. What the agent can see without asking. What requires confirmation. What is forbidden. What is auditable after the fact. Your browser has trained you for this already: When a website wants your camera, the browser asks which site, what for, how long, and lets you take it back. Your AI assistant does the same job — reading your screen, taking actions on your behalf — with no equivalent guardrails. Vercel’s OAuth integration breach a few weeks ago was the first major proof case for why the absence matters. That’s the gap Harbor takes a first run at: per-origin, scoped, revocable, auditable. If your AI assistant has root permission to read everything on your screen in 2026, your encryption story is whatever your vendor decides it is. That isn’t encryption. That’s optimism.

While the US debate stays at the loss-of-control framing, Beijing wrote the permission model LangChain didn’t. On May 8, China’s cyberspace, planning, and industry ministries jointly released Implementation Opinions on the Standardized Application and Innovative Development of Intelligent Agents — the first state-directed national framework to operationalize AI agents as a distinct governance category. Beijing is writing the traffic rules while the cars are on the highway. Washington is still debating whether what’s on the highway counts as cars.

The definitions are specific. An agent is “an intelligent system capable of autonomous perception, memory, decision-making, interaction, and execution.” Nineteen named application scenarios — research, industry, consumer, public welfare, governance — where agents are explicitly allowed to operate. The posture analysts call deploy first, govern along the way: compute quotas, credit ceilings, permission scopes, and shutdown switches naturally bound agent autonomy, and the right response is to integrate them into existing institutional structures rather than impose abstract restraint upfront.

There’s a strategic move underneath the philosophy. The framework ties agent infrastructure to the domestic stack (chips, OS, frameworks) and signals intent to participate in international standards for the protocols agents will use to talk to each other. You don’t have to agree with the framing to see what just happened. One major jurisdiction defined what an agent is, what it can be deployed for, and how it’s bounded. The others are still arguing about whether agents are a coherent regulatory object. If you’re shipping agents into a global market, the question of which framework you’re building against just stopped being hypothetical.

What the User Actually Sees

People underestimated what Google’s 10 blue links were. You couldn’t reconstruct PageRank, but you could feel it — two queries, two minutes, ten URLs each, and you saw who got ranked, who didn’t, what the snippets gave away. You audited the system by reading it.

One AI answer gives you none of that. The model picks, summarizes, drops the rest. You don’t see what it considered, what it suppressed, why it leans where it leans. Call this the one-link problem: the platform is hiding its incentives from you, and you have nothing left to audit them against.

A Princeton and UW paper from April 9 put numbers on it. Across 23 LLMs given a flight-booking task prompted to favor sponsored airlines, 18 recommended the more expensive sponsored option more than half the time — and the rate moved with the user’s apparent socio-economic status. Gemini 3 Pro recommended it 74% of the time to a user who was coded as high SES (neurosurgeons, lawyers, tech executives), and 27% to ones coded as low SES (fast-food workers, warehouse staff, single parents). When the user explicitly asked for a non-sponsored flight, every model still surfaced the sponsored one, with GPT 5.1 at 94%. Almost every model concealed that the recommendation was sponsored at all: 65% on average. When the sponsored service was a predatory payday loan, GPT 5.1 still recommended it 71% of the time.

One-shot answers are useful and laypeople will keep wanting them. For the people building these tools, surfacing your model’s incentives — what’s sponsored, what got down-ranked, what didn’t make the cut — looks like a transparency tax. It’s actually the only durable feature an AI tool has in 2026. For the people buying them, opacity feels like a moat. It’s commodity status with a disclosure problem.

The architectures that show their wiring is the ones you can own. The test for any AI tool you’re considering this week is the one the Princeton paper accidentally wrote: Ask it the same question with two different profiles. If the answer changes, you don’t have a recommendation. You have a price tag.

Subscribe now

Call My Agent

Raffi Krikorian — Wed, 13 May 2026 15:15:13 GMT

You’ve probably already noticed this: The same AI feels different in different tools — sharper in one app than another, even when you know it’s the same model underneath. There’s a reason.

In the first gap map, I drew the open-source AI stack as a vertical thing: chips at the bottom, applications at the top, every layer scored against its closed-source equivalent. That worked for one shape of system, where your application makes the inference call, formats the result, and hands it to the user. Chat boxes, completion plugins, single-turn summarizers — your application is the runtime; the model is one resource it happens to use. Call that direct inference.

But that’s not the only shape anymore. The thing in your editor that ran tests, found a bug, fixed it, and committed the change isn’t shaped like that. Neither is the briefing that pulled your feeds this morning, picked what mattered, and routed it to you. In those systems, your application doesn’t call the model at all. It hands a goal to an agent. The agent breaks down the work, picks the tools, keeps state across turns, decides when to ask a human, and decides when to stop. The agent is the runtime, while the model is one resource among several. Think agent-mediated.

These aren’t stages. They’re peers. A summarization tool doesn’t need an agent. A research assistant can’t work without one. Both architectures will stick around.

Hermes Agent, Claude Code, Cursor, Aider, Cline — all take the side path. The original map only scored direct inference.

I wrote a white paper sketching how this layer breaks down — runtime plane, control plane, the cross-cutting attributes that Basdevant et al. introduced for foundation models, all applied to the substrate above inference. It’s still in draft. Read it and tell me where I’m wrong.

What the harness is actually doing

Terminal Bench 2.0 is the standard leaderboard for coding agents - 89 real tasks, scored on how many the agent finishes. Same model, Claude Opus 4.6, identical weights: 58.0% in Claude Code, 79.8% in ForgeCode. Twenty point spread. A new frontier model release typically buys you 3 to 8% gains. This was just a harness swap.

Imagine a singer in a recording booth. The voice hitting the mic is the model. Compression, EQ, reverb, mixing — everything between the mic and your ears is the harness. Hand the same vocal take to two engineers and you get two different songs.

Vivek Trivedy at LangChain has the cleanest framing in Anatomy of an Agent Harness: Agent = Model + Harness. (And if you’re not the model, you’re the harness.) The harness is everything in the running system that isn’t the weights: context management, tool catalogs and dispatch, sandboxes for code execution, memory subsystems for what to remember and forget, orchestration logic for when to call what, hooks for deterministic safety checks. None of that is the model. All of it shows up in the benchmark.

You can reproduce this on your own machine. Point Cline at the same model Claude Code uses — same weights, same API endpoint — and you get different behavior on the same task. Try OpenCode against Cursor on whatever frontier model they share. Same model, different harness…different result. The harness picks what context the model sees, which tools to surface, whether to read three files or 30 before writing one, what to remember between turns, and what to forget. By the time the result reaches you, the harness has shaped most of what you’re looking at. The model is the last step in a chain of decisions the harness has already made.

It happens in things simpler than a coding agent. The morning briefing I built in last week’s DIY Daily post picks which feeds run, what gets summarized, and what gets dropped. Swap the model behind it and the prose reads slightly differently. Swap the harness and you get different feeds, different memory of what I’ve already seen, different rules for what’s worth a line. The briefing is basically unrecognizable.

Last week I pointed at Karpathy’s microGPT, the complete ChatGPT algorithm in 200 lines of code. The wrapper around the weights does more visible work than the weights themselves.

There’s a real exception. For frontier reasoning, novel domains, and long-horizon work where the agent has to invent its own evaluation as it goes, the model still matters more than the harness. The harness amplifies signal the model has to generate first. For the workloads most teams ship this quarter, the harness is the bigger variable.

The first gap map didn’t have a row for any of it.

The most open part of this story. (It isn’t a product.)

Same rubric as last time: 1 means the open version barely exists, and 5 means closed has nothing on us. Cell-level scores in the spreadsheet. (Same disclaimer: It’s almost certainly wrong in places.)

The headline: Agent-layer enterprise readiness lands at almost exactly the number the first gap map produced for the whole stack. The shape that ate the open-source story ate this layer, too, one level up.

Rows 45 through 50 show the failure.

The components are in good shape. Pick any framework, any open harness, any standard underneath— you can build with it today. LangGraph and AutoGen on the framework side. Cline and OpenHands on the harness side. Underneath them, MCP wiring the tools and A2A wiring the agents. Quietly, the protocol layer scores higher than anything sitting on top of it. The most open part of this story isn’t a product. It’s the wiring.

Then the seams. A task you start in LangGraph cannot be resumed in Semantic Kernel. Each framework keeps its own state its own way, and none of those states travel. You aren’t picking a runtime when you pick an agent framework. You’re picking a one-way door.

The permission model is broken. It’s the lowest-scored row anywhere in this exercise. No shared standard for what an agent is allowed to do, what needs your approval, what’s forbidden, how it gets logged. Every harness solves this differently, and none of those solutions move. Harbor and W3C Verifiable Credentials sketch what a portable answer could look like. But nobody has built it.

A model migration is an afternoon. A harness migration is a rewrite. Tool integrations, memory schemas, permission decisions, prompts — all coupled to the runtime you picked. Every team picking a harness this quarter is making a multi-year decision and probably treating it like a tooling decision.

The wrapper around the wrapper

In the last post, I made the Linux comparison: In 2002 the kernel was ready, the enterprise wrapper wasn’t. Red Hat, SUSE, and Canonical built the wrapper. Sun didn’t ship in time, and you know how that story ends.

The agent layer is the wrapper around the wrapper. The model and the inference layer underneath it are in good shape in the open. The substrate above them — the part that turns a model from a function call into an agent — is what the community hasn’t named, hasn’t standardized, hasn’t documented, and hasn’t built a permission model for.

Here’s the work that needs to be done:

A portable permission model. The paper alongside this post names this as the largest single openness gap at the agent layer. There is no shared standard. Please build one.
Interoperability between runtimes. The seven core objects in the paper — agent manifest, tool descriptor, task, message, artifact, memory item, run/checkpoint — exist under different names in every implementation. Agents in different frameworks can’t hand off work, query each other’s memory, or coordinate. Those names are a starting point, not a contract — yet.
Agent manifests and tool descriptors with the same rigor model cards have. What each agent claims, requires, and permits; what each tool costs, risks, and returns.

If we have two more years in which the open agent layer is a side project, the protocols won’t matter. The harnesses doing the work will be closed. The permission decisions will live in someone else’s config file. The runtime your team picked will be the runtime your team is stuck with. Open at the model layer can’t fix open stopping at the model layer.

…and how you can help

The new spreadsheet is bound to be wrong in places. If you’re building one of these harnesses or evaluating one for a team, tell me where the scores are off: @raffihack on X, in the comments, or DM. I’ll update.

Then find a red row you know how to fix. The permission model, the runtime seams, the manifest layer — that’s where the assignments are. If you’re picking one up, talk to me. I’ll feature what you’re building.

The white paper goes deeper if you want the framework. I’m taking edits.

Let’s ship the wrapper.

Leave a comment

The Complete ChatGPT Algorithm in 200 Lines + Minecraft’s Huge Secret

Raffi Krikorian — Fri, 08 May 2026 16:18:37 GMT

Here’s what I’ve been reading, thinking about, and playing with. Open weights stopped being the headline this week. The stack around them — runtime, harness, audit, distribution — is where the action moved.

Andrej Karpathy — formerly of OpenAI and Tesla — published microgpt.py: the entire architecture behind ChatGPT compressed into 200 lines of pure Python, no dependencies. The header reads: “This file is the complete algorithm. Everything else is just efficiency.” The internet took it as a dare. Five thousand stars. Two thousand forks. Ports in Rust, OCaml, Julia, JavaScript, CUDA, plus a pure-C version. Then someone shipped it to silicon.

A footnote: Karpathy said in a recent No Priors interview that he tried to get an agent to write microgpt for him. It can’t do it, he said. So he wrote it himself. A data point about who’s still doing the cognitive work at the bottom of the stack.

Back to silicon. People build custom silicon — special hardware for a single purpose — because it’s supposed to be faster than your CPU, which has to be general-purpose. Luthira Abeykoon’s TALOS-V2 is the same Karpathy model running on an FPGA — a programmable chip you wire up for one job — on a $350 hobby development board. The writeup at v2.talos.wtf is one of the best pedagogical hardware-design docs I’ve read this year. A few weeks ago, Alex Cheema’s benchmark put TALOS-V2 head-to-head with a MacBook running the same model multiple ways. Tokens per second, slowest to fastest:

MacBook, MLX (Apple’s GPU framework): 3,300
MacBook, pure Python: 7,400
MacBook, NumPy: 40,000
TALOS-V2 (FPGA): 53,000
MacBook, hand-tuned C on one CPU core: 3,760,000

The MLX result is the surprise. GPUs win by doing thousands of math operations in parallel — but every batch has to be shipped over with some fixed setup cost. On a normal model, that cost is invisible. On a model this tiny (~4,000 multiplications per token), the setup is bigger than the work. The GPU sits there, waiting. The same story explains why Python and NumPy lose to the FPGA, and why hand-tuned C — almost zero overhead — wins by 71x. The FPGA’s remaining moat is form factor: It runs off a battery on something the size of a credit card, and a MacBook does not. The right question for tiny on-device AI isn’t How fast is your custom chip? It’s Do you actually need the form factor? If not, the laptop you already own, running plain C, beats everything.

Most “I got this running locally” stories are bloat-stripping stories. Quantization, custom runtimes, dropping the GPU when the GPU is overhead — same move at different layers. What open weights buy you isn’t the weights. It’s the right to choose your abstraction tax.

But what about a translator that handles 1,056 language directions, runs offline on your phone, and outperforms Microsoft Translator, Doubao, and open models 20–40x its size (Tower-Plus-72B, Qwen3-32B) on Flores-200? That model is 440MB, fits on a USB stick, and Tencent open-sourced it as Hy-MT1.5-1.8B-1.25bit last week.

How they got there is the part you’ll want to steal: 1.25 bits per weight. Most weights end up at one of three states, and the model still wins. Tencent paired the new compression with a custom mobile-CPU runtime fast enough to actually use on the hardware in your pocket. Model + compression + data + runtime + open-source packaging is a five-layer practice that didn’t cohere as an industry discipline 18 months ago. The model is the headline. The discipline is the alpha. If your app currently sends translation traffic to a commercial API, that’s now a build-vs.-buy decision rather than an obvious buy.

Training stays on the device, too. Federated learning lets a million phones (or hospitals, or banks) collaborate to train a shared model without any of their raw data ever leaving the device. You probably typed something today using one: Gboard’s next-word prediction has been trained this way since 2017, on billions of phones, without Google ever seeing what anyone typed. The catch is that it falls over in production because real devices are slow, flaky, or both. MIT’s FTTE, out this past week, handles that at scale: 81% faster convergence, 80% lower on-device memory, 69% lower communication overhead, validated on four Raspberry Pi 5s with up to 500 simulated clients, and 90% of them lagging behind. Privacy-preserving AI moves from research demo to deployable shape. If your domain has a privacy requirement that’s been keeping AI off the table, the deployment shape just got real.

Your dependence on closed top-tier APIs is now a procurement choice rather than a technical one. Ant Group's Ling-2.6-1T (MIT license) is the reason. A trillion parameters, eight H100s minimum to load — roughly $250K of GPUs, or ~$50/hour to rent inference — post-trained for intelligence-per-token rather than benchmark-per-token. Fewer process tokens burned per useful answer, which is the metric that shows up on your inference bill. Until this quarter, swapping a closed-frontier API for an open model meant accepting a meaningful capability drop. Ling closes that gap, and it plugs into the agent runtimes builders are already using (Claude Code, OpenClaw, OpenCode). The swap is one config line.

SenseTime open-sourced SenseNova-U1 the same week. Today’s multimodal AIs are chains — one model sees, another reads, a third writes — and detail leaks out at every handoff. SenseNova-U1 fuses them. The 8B variant fits on consumer hardware. Closed labs still own the frontier. They’ve lost their monopoly on how to think about it.

The open stack now runs from $0 to $1M of hardware, with a real choice at every price point.

Open weights without a deployable harness don’t change the build-vs.-buy math — that’s a gap a lot of my Mozilla work points at. The harness is the software that wraps the model: filesystem, bash, memory, scheduled re-prompting, skills as just-in-time tools. LangChain’s Anatomy of an Agent Harness is the cleanest map of what a harness actually is, and the slogan is the right one: if you’re not the model, you’re the harness.

On Terminal Bench 2.0, identical Claude Opus 4.6 spreads 20 points — 58.0 in Claude Code, 79.8 in ForgeCode. Models are commoditizing. Harness engineering is where the leverage is. The missing piece in LangChain’s catalog is a real permission model: What the agent can do without asking, what requires confirmation, what’s forbidden. That’s the gap Harbor (sketched in an earlier post and in last week’s piece) takes a first run at.

Subscribe now

Five steps. That’s how long it took to put an AI agent on a Mac. wexare-ai/openbrowserclaw fits an entire agent into one browser tab: Web Worker for the server, IndexedDB for the database, OPFS for files, a Linux VM compiled to JavaScript for bash.

Solstone is open source, runs locally, and captures what you see and hear into a searchable timeline. The interesting part is corporate, not technical. Solstone’s parent, Sol PBC, wrote, “We don’t sell, license, or lease your data” into its charter, and bound any future acquirer to the same terms. Marketing usually says, “You own your data.” Solstone made it legally enforceable.

A chatbot’s mistakes cost tokens. A robot’s mistakes break things. The body and simulator side of robotics has been opening up (Newton, Asimov, RoboParty, tracked here last month). Three pieces released this past month do the same for methodology and data, the layers that matter more before you let a robot loose in your house.

NVIDIA’s RoboLab is the first serious audit benchmark: 120 tasks deliberately built so the model can’t just memorize objects from the most popular training set. RAI Institute’s ExpertGen takes 200 imperfect human demos and turns them into 80% real-world success on a manipulation task — where standard imitation training on the same data lands at…0%. Peking’s LDA-1B trains a 1.6B-parameter robot model on 30,000 hours of mixed-quality data and shows that adding the noisy data makes the model better, not worse — same instinct as FTTE, applied to a body. The audit surface for a robot is its evals, its methods, and its data. Next time you scope a robotics vendor, that’s the spec sheet to ask for.

To end on something completely different: Reporters Without Borders re-opened the Uncensored Library in March with a new US wing: banned and censored journalism distributed inside Minecraft, because Minecraft is the rare piece of internet infrastructure no government has bothered to block. Over a million visits. Ten million books read. RSF has been smuggling press freedom into authoritarian countries through a video game since 2020. It’s the same instinct as the rest of this letter, applied to a sandbox where 12-year-olds build castles. The gatekeepers can’t reach a layer they don’t take seriously. That’s the layer to build on.

What have you been diving into this week? What have I missed? Leave a comment below.

Leave a comment

DIY Daily

Raffi Krikorian — Wed, 06 May 2026 13:31:39 GMT

It’s 6:30 a.m. My phone buzzes on the kitchen counter. I pick it up and see three short paragraphs, plus five bullets, one possible newsletter angle, and one contrarian take, followed by source links at the bottom — just the way I told my model server I wanted them last week. Read time: 90 seconds, between the first sip of coffee and the second.

The thing that wrote it isn’t on my laptop. My laptop is off. It lives on a small Linux box in a closet, and it’s been awake since I went to bed, reading the news for me on a beat I actually care about. By the time I’m pouring coffee, its reading is done. Some of what it delivers will end up shaping what I write for you: the leads I chase, the links I follow, the threads worth pulling on. That’s the recursion, and it’s part of what this post is about.

The briefing isn’t the only thing the box does. A contractor working on the house sends invoices in whatever format he opens that morning — PDF, scanned photo, sometimes just numbers typed into the body of the email. I connected it to Gmail and a Google Sheet, told it which sender to watch, and asked it to pull line items and append them. The sheet now updates itself. I look at it on Sunday afternoons.

I also keep a running list of restaurants people tell me to try, sorted by city, and the list lives in its memory. When I land somewhere, I ask, and it tells me what I told it last time someone mentioned the place. A hosted assistant could do all of this in theory. None of the ones I use do, because the memory isn’t the product — the chat is.

A month ago, I had you turn your Mac into a model server — llama.cpp. The marginal cost of the next token was zero, assuming you don’t count battery drain or the fact that your thighs were now doing some of the thermal work. Then you closed your laptop, and the model went to sleep with it. Last month we made the model fast. This month we give it somewhere to be when you’re asleep.

Don’t self-host an agent to get cheaper chat. Self-host one to get something that doesn’t sleep when you do. Think of the first useful self-hosted agent not as an AI friend, but as an AI beat reporter. By the end of this post, you’ll have one — even if you skip every command and just read along.

What an agent actually is (and why ChatGPT isn’t quite one)

If you’ve used Cursor or OpenCode, you’ve used an agent — a model in a loop (or, maybe more honestly, a model in a while True: loop with a break condition the model itself decides on). It’s called the ReAct pattern: read the situation, decide what to do, do it (using tools), look at what happened, go again. Coding agents run that loop with a narrow toolbelt: read code, edit code, and run tests. A general-purpose agent runs the same loop with a wider one: search the web, draft an email, and ping you on Telegram when the build breaks.

When you use ChatGPT, you are the agent. You read its answer, decide if it’s right, ask the next question, and paste in the error message yourself. A real agent is the loop. That’s the whole gap between asking ChatGPT to summarize what’s new in local AI every morning and waking up to a briefing already in your phone, generated while you slept.

A note before we go further: An agent that can act can act wrong. An agent with shell access can rm -rf something it shouldn’t. A wrong answer is a waste of tokens. A wrong action is a waste of trust. An agent reading attacker-controlled web pages can be talked into doing things you never asked for — that’s what people call prompt injection, and it’s part of the threat model the field is still figuring out.

We are at the very beginning of figuring out what permissions for agents should look like: what gets done without asking, what requires confirmation, and what’s flat-out forbidden. You can take a look at my small proposal, called Harbor, which is currently more conversation-starter than product, and still needs work. The rest of this post is, in part, what “be careful” looks like at the level of one practitioner with a laptop: sandbox the agent, narrow what it can touch, lock the front door to one person, and verify the boundaries by hand. From my perspective, the interesting question with agents isn’t “Can it do this?” — it’s “What happens if it does the wrong version of this?” The answer should be: not much.

Subscribe now

Some disclosures…and which agent we’re going to use

A few things to put on the table before I tell you to install anything.

First: I’m the CTO of Mozilla, and there are two Mozilla projects already alive in this neighborhood. Thunderbolt, out of MZLA, is the open-source self-hostable AI client for organizations that want a sovereign stack — the on-prem version of this conversation, scaled up. Octonous, out of Mozilla.ai, is the agent product for people who want clear scope and approvals before any workflow. Both deserve their own walkthroughs and will probably get them — just not today.

Second: Mozilla is also an investor in Nous Research, whose agent runtime, Hermes, is what I’m about to recommend. We saw the work first; the check was sent after. Tell me I’m wrong on the merits, not on the affiliation.

Third — and this is the one the comments section is going to want to fight about: OpenClaw. It’s the agent that went viral last winter — Hard Fork did an episode, Anthropic’s lawyers forced a rename, its creator joined OpenAI in February — and it is, fairly, the moment “AI agents” became a category in the public mind.

While it’s not where we’re starting today, there is a security story worth flagging: One of OpenClaw’s own maintainers warned on Discord that the project is “far too dangerous” for users who can’t run a command line, and the foundation is in the middle of figuring out who steers it now that the maintainer is at OpenAI. But that’s not really why. The deeper reason is that Hermes is built to become yours. The memory is a file you can cat. The skills are a directory you can ls. The model can be swapped without rewriting anything. The whole arrangement can be picked up off one machine and put down on another. I’m interested in things built for owners, not renters, and Hermes is the project that lets you own this one.

And Hermes is open source — which matters even more for an agent than it does for a model. A model that hallucinates wastes some tokens. An agent that hallucinates takes an action you didn’t ask for, and you find out after. Software that acts on your behalf should be software you can read. Not “audit” in the compliance sense — cat the file and see what it’s about to do. You don’t have to read it. You have to be able to.

What you’re building

Five layers, in this order: the local model endpoint from last issue (or its faster cousin, oMLX, which I’ll explain in a minute), Hermes on top, Telegram as the interface (locked to your numeric user ID and nobody else’s), one recurring job worth doing, and eventually an Ubuntu box that keeps running it after you close the laptop. The first three live on your Mac. The fourth is what makes the fifth worth doing.

The shape of the rest of this post is five steps. Steps 1 through 5 get you a working agent on your Mac, end to end. In a future issue, we’ll do step 6 — the upgrade — moving it onto a server so it keeps working after you close the lid. If you stop after Step 5, you still have something useful.

Step 1: A model endpoint worth pointing an agent at

If you set up llama.cpp last month, you can skip the install, but maybe read the next two paragraphs anyway.

Coding agents — and agent workloads in general — don’t send one prompt and walk away. They send dozens of requests in quick succession, and every request has to ship the entire conversation so far: the system prompt, the tool definitions, the codebase you handed it, every previous turn, plus whatever’s new. Imagine ordering at a restaurant where the waiter has to re-read the entire menu out loud, top to bottom, before you’re allowed to say “and a side of fries.” That’s what the model is doing on every turn.

The thing that normally saves you from this is the KV cache, the model’s running notes on the earlier part of the conversation, kept in memory between calls so it can skip ahead instead of re-reading. But the cache only works if the earlier part — what people call the prefix, basically everything before the latest message — is exactly the same as last time. Add a new file to the context, edit a tool result, change a single token near the top, and the cache is invalidated. The waiter starts the menu over from page one. A few turns in, you’re watching a spinner for 30 to 90 seconds while it re-reads what it already knew. The model used to be fast. Then the agent loop blew through the cache.

oMLX is a native macOS inference server built on MLX, Apple’s own ML framework, and its headline feature is paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix comes back, it’s restored from disk instead of being recomputed — the waiter remembers your menu from yesterday. The project’s own numbers match what I see on an M4 Max: time-to-first-token drops from 30–90 seconds to 1–3 seconds on long contexts in the second-or-later turn of an agent session. That is the difference between “local agent I gave up on” and “local agent I actually use.”

Setup is undramatic. Download the DMG from the release page, drag to Applications, set the port to 8000 and the API key to localdev. For the model, take the boring default: mlx-community/Qwen3.5-9B-MLX-4bit. We can optimize or choose something more interesting later.

Step 2: Hermes, and the first boundary

Install Hermes:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.zshrc

That gets you the hermes command. The next thing is to tell Hermes which model to use. The way you do that is hermes model, which is an interactive walkthrough — not a flag-festival, and not a config file you have to memorize. It opens a menu of providers; pick “Custom endpoint (self-hosted / VLLM / etc.),” and then it asks for four things in turn:

URL: http://127.0.0.1:8000/v1
API key: localdev (the placeholder you set in oMLX)
Model: Qwen3.5-9B-MLX-4bit
Context length: 131072 — half of what the model can natively handle, which is plenty for an agent that hasn’t earned a longer leash yet. (Hermes refuses to start with anything under 64K by design. Agents need working memory.)

Now, before the first prompt, let’s think about the boundary. By default, Hermes will execute its tool calls, which include shell commands, directly on your computer. But, on a laptop full of credentials, source code, and a browser logged into half the internet, that seems crazy. Change the terminal backend so the agent runs its tool calls inside a Docker container instead of directly on your machine. If you don’t already have Docker on the Mac, install Docker Desktop (or brew install --cask docker if you’d rather get there from the terminal). Then:

hermes config set terminal.backend docker

It is a cheap and imperfect boundary, but it is definitely better than no boundary. It is the right default for a tool you’re learning. You may get annoyed early because the agent can’t see some files you can. But that’s a feature and not a bug! That’s the boundary working, not the agent failing.

Now run hermes and ask it something easy to verify:

Summarize the README in this directory.

If you see it tool-call its way through reading the file and hand back a summary, you’re past the hardest part. From here, everything else is shaping the system around the conversation it can already have.

Step 3: Telegram, locked to you

You need a way to talk to this thing while out and about, and without sitting in front of the terminal. The candidates are many, but, for me, it’s Slack, email, and Telegram. Slack was wrong because I don’t enjoy having to wait for the IT team to approve things for me. That’s the right thing for “production,” and absolutely none of it is correct for an evening of tinkering. Email had its own complications. Telegram, by a wide margin, is the fastest path from no bot to bot replying to my phone. (If you don’t use Telegram day-to-day, that’s fine — you don’t need to. Install the app, talk to one bot, never open it again.)

The mechanics take 15 minutes. Talk to BotFather, get a bot token. Get your own numeric Telegram user ID — @userinfobot will tell you. Then hermes gateway setup walks you through wiring those two values in. The values land in ~/.hermes/.env as:

TELEGRAM_BOT_TOKEN=
TELEGRAM_ALLOWED_USERS=

That last line is a lot of the protection. Without it, anybody could, theoretically, message your bot — and now a stranger is sending a prompt that runs against your agent with shell access to your container on your laptop. With it, the gateway will refuse messages from anyone whose numeric ID isn’t on the list. (If you ever want to add another person, it’s a comma-separated list. Do not add another person on day one.)

I deliberately didn’t install Hermes as a launchd service yet. launchd is the macOS process that runs programs in the background and restarts them when they crash, so they keep running even when you’re not looking at them. When you’re tinkering, you don’t want that. I wanted to see the gateway receive messages, reply, and fail in plain sight before I tucked it away into the operating system, which it promptly did: My first hermes gateway run failed with database is locked on the local SQLite file, because another Hermes process was still holding it from earlier. This is the unglamorous truth of self-hosting: a real chunk of the work is debugging operational plumbing — sockets, ports, file permissions, processes that won’t release a file — none of which is slick, and none of which is about AI. Most of the work in self-hosting isn’t AI. It’s port numbers.

I killed it. Restarted. Sent /start from my phone. Got nothing. (Turns out /start isn’t wired up as a special handler.) Sent hi instead. Got a reply.

So now, my Mac was actually serving a model through an agent to a phone, with exactly one human authorized to talk to it. It was also, if I’m honest, the moment the project felt real. Everything before it was set-up. Everything after this is product work.

Step 4: Give it one job

A bot that replies to hi is a demo. Now it needs a job.

The job I gave Hermes is to create the briefing from the opening of this post: scan the web each morning on a beat I care about — local AI models on consumer hardware, self-hosted agents, the open stack — and write it up as three short paragraphs and five bullets, ending with one possible newsletter angle and one contrarian take. Source links at the bottom. Deliver it to me via Telegram by 6:30 a.m.

Step one, Hermes needs to read the web, which is harder than it sounds, because Google and most search engines actively block automated traffic. Point an agent at google.com and you’ll get CAPTCHAs and rate limits. The workaround is a search/scrape provider that handles the anti-bot work and hands you back clean text. I used Firecrawl, Hermes’ default web backend. Sign up, grab a key, and add it to Hermes’ own environment file (~/.hermes/.env), not your shell’s. The gateway only reads its own .env; dropping FIRECRAWL_API_KEY=... into your terminal session won’t reach the long-running process.

echo ‘FIRECRAWL_API_KEY=fc-your-key-here’ >> ~/.hermes/.env

Then restart the gateway so the new key is in effect. (I forgot that part the first time and spent 10 minutes baffled by why the agent couldn’t search the web.)

A quick note, because this trips people up. There are two surfaces: the Hermes console (the interactive TUI you launch with hermes from a terminal) and Telegram (the chat bot you just wired up). They are not interchangeable. The console is where you teach the agent new things — iterate on prompts, paste in critiques, save skills, watch the tool calls happen in real time. Telegram is the delivery surface — short messages out, short answers back, scheduled briefings showing up in the morning. You could technically teach the agent through Telegram, but you’d hate it. The console is where the work happens; Telegram is where the result shows up.

Here’s the workflow that works: gateway running in one terminal, Hermes console in a second terminal, where I iterate on the briefing prompt by hand until it produces what I want.

Tighter. Cut the editorializing. Lead with the link, not the headline. Stop telling me what’s interesting and just tell me the news.

When the output stops disappointing me, I hand it to Hermes’ built-in scheduler:

hermes cron create “every 1d at 06:30” \
“Run the daily Owners, Not Renters briefing and deliver to Telegram”

That’s it. No system crontab to edit, no separate scheduler to install. Hermes runs the prompt at your scheduled time every day inside a fresh agent session, and the gateway delivers the result to the platform you configured.

It’s pretty neat to wake up to a real briefing — three stories, one of them genuinely new to me, sitting at the top of my Telegram before I’d opened my laptop.

Step 5: The part where it starts to become yours

A few days in, I had opinions. The summaries were too broad. I wanted more on self-hosted agents, less on every new model release. I wanted five bullets, not three paragraphs. I wanted a single contrarian takeaway in one sentence at the end, with no preamble. I told Hermes that in the middle of a console session:

Too broad. Focus more on self-hosted agents and local models. Five bullets. End with one newsletter angle and one contrarian takeaway. Remember this format for future briefings.

Then:

Save this process as a skill called owners-briefing.

The next morning’s briefing was in the new shape. The morning after, too. And the morning after that. I had opened a session, complained about the output, and the complaint had stuck.

That’s also the moment to upgrade the cron job. Instead of stuffing the format into the prompt every time, you point cron at the saved skill:

hermes cron edit  --skill owners-briefing

You can edit the schedule, the prompt, and the format independently. Tomorrow, you might decide the briefing should run twice a day, so you change the schedule. Next week, you might add Slack delivery; that’s a separate flag. And the format you taught lives in the skill, ready to be reused by anything else you want to give it to.

That is the upgrade. It’s the thing that is genuinely hard to get from a hosted assistant whose memory belongs to someone else’s product roadmap.

What you have, and what’s next

What you have at the end of Step 5 is a real thing: a model on your machine, an agent running on top of it, a chat surface locked to one human, and one recurring job that gets better when you teach it. It runs while your laptop is open. It learns when you correct it. It belongs to you, not to a hyperscaler. That is — genuinely — most of the value.

What you don’t have yet is permanence. Close the lid and the briefing stops. Travel and the appointment goes unkept. In a future post, I’ll show you how to move this whole stack onto a small Linux box that stays awake when you don’t, without losing any of the boundaries we just spent a thousand words putting up.

For now: Tell me what you’d give it. What’s the one job that would actually show up in your morning if it just ran — without you remembering to ask, without you opening a tab? Reply, or drop it in the comments. The more concrete jobs I have to work from, the better the next post is going to be.

Two things to take with you

Treat your agent like a new hire. If you hired a personal assistant, you wouldn’t hand them your password manager, your bank login, and the keys to your house on day one. You’d give them your calendar, maybe a corporate card with a low limit, and you’d extend their access as you watched them work. Security folks call this least privilege. It’s the right posture for a software agent, too — and it’s not the default posture you get when you install one. Every step in this post was a chance to give the agent more access than it needed: the Docker boundary, the Telegram allowlist, the model running locally rather than wired to half the internet. The point isn’t that any one of those is doing the security work — it’s that you have to be paranoid at every step. That’s the cost of participating in bleeding-edge stuff: the muscle memory doesn’t exist yet, so you do the thinking out loud, every time, until it does.

Give it one job, not ten. The pitch you’ll hear from most of today’s agent software is that it can do everything: read your email, manage your calendar, post on your behalf, run your day. The temptation on day one is to let it. This is exactly how people end up telling agent horror stories at dinner parties. Pick the smallest workflow that would be useful if it ran perfectly, get it working, watch it for a week, and only then start thinking about a second. The briefing didn’t get good because the agent could do many things; it got good because there was exactly one thing to learn the shape of, and one thing to tune. With seven things going at once, you don’t get to see what your agent actually does — you just get a vague sense of whether the day went okay. With one, you start to know it. One job. One feedback loop. Earn the second job by being unsurprised by the first.

Leave a comment

Let’s Get You Paid

Raffi Krikorian — Wed, 29 Apr 2026 13:27:03 GMT

Two weeks ago, I pointed at the gap map and asked you to find a red zone you knew how to fix.

But pointing at the work doesn’t pay for it. Fixing a red cell takes time, and time is what most people capable of the fix are already selling to someone else. So, who actually does the work?

Open source starts everywhere. It scales on payrolls. Linus Torvalds posted Linux to a newsgroup in 1991; it became the kernel that runs the planet after IBM put a billion dollars and 1,500 engineers into it in 2001. Rust was Graydon Hoare’s side project; Mozilla (where I’m CTO) picked it up in 2009 and put engineers on it. PyTorch came out of Meta. Kubernetes came out of Google, carrying forward a decade of lessons from Borg and Omega.

It’s the same script with AI. Look at who’s already paying engineers to close the gap.

The infrastructure layer: Databricks, Hugging Face, Together, Fireworks, Mistral, Anyscale, Modal — one sells something that gets more valuable when open models are enterprise-ready. These are the sponsors of the work. Not in a donations sense. In a payroll sense.

And they’re already hiring. PwC’s 2025 Global AI Jobs Barometer, based on close to a billion job ads, pegged the AI wage premium at 56% — more than double the 25% it found the year before. Motion Recruitment’s 2026 tech salary report saw AI specialization postings up 49%, while senior software developer pay fell 10%.

So the question isn’t whether AI work pays. It’s whose payroll you pick.

Whose payroll already depends on this work?

I’ve been pulling that list as part of my work at Mozilla, posting open roles at the companies actually doing it. You can find an alpha version of it at Weights and Roles. It’s still very rough and very incomplete. (Populating it is the first job. Seventy-eight pages and counting.) I promise, the code goes open source once it’s worth shipping.

If you’re an engineer thinking about your next move, go look. Tell me which companies you’d add.

If you’re hiring at a company building in this stack — open weights, infra, dev tools, governance — email me at raffi@mozilla.org, and I’ll get you on the list. No fee, no tier, nothing to upsell.

The gaps in the above map close when the companies whose business depends on it get to work. Figuring out who those are — and whether you want to work for one — is this year’s biggest question.

That’s not a downgrade from the dream. It is the dream. While getting a salary.

Weekend heroes are great. But let’s get open-source engineers on payrolls.

Leave a comment

Chinese labs, benchmaxxing, and the Common Crawl for robot data.

Here’s a snapshot of what I’ve been reading, thinking about, and playing with this week.

This talk has me rethinking reality. (Again.) Carissa Véliz, an associate professor of philosophy at Oxford’s Institute for Ethics in AI, used her TED slot this year to argue that predictions about people don’t describe the future. They bend it. “Social predictions tend to act like magnets,” she says. “They bend reality towards themselves.” Predict a loan applicant is high-risk, deny the loan, and the prediction makes itself true. Her warning: Watch out for anyone who tells you the future they’re describing is inevitable. I made this argument all the time on Technically Optimistic: Technology isn’t inevitable — it isn’t whatever the tech industry hands you — and we have a say in bending where it goes. Véliz takes it further: Inevitability-framing is a command in disguise, designed to get you to skip the architectural review, the eval pipeline, the rent-vs.-own decision. The fraud model you ship next quarter doesn’t describe your customers. It picks which ones get treated like fraudsters and writes the data that proves it was right. Treat every output as a prediction, not a fact. The hedge Véliz proposes on Jeff Wilser’s AI-Curious podcast is hilariously analog: keep retired engineers on speed dial for the day nobody remembers how to run things manually.

Open-source robotics had a remarkable few months…

The simulator: Newton 1.0 shipped in March, codeveloped by NVIDIA, Google DeepMind, and Disney Research, and powering Disney’s Star Wars-inspired BDX droids. It’s hundreds of times faster than the previous open standard on manipulation tasks;
The body: Asimov Inc. open-sourced its full bipedal humanoid on April 27, with a $15,000 DIY kit on pre-order. If $15K is today’s cost, imagine 18 months from now; and
The full stack: RoboParty’s roboto_origin, out of China, went from blank sheet to running, jumping prototype in 120 days, with a BOM you can fill from Taobao. Earlier this year, K-Scale Labs collapsed while trying to raise $20M for industrial tooling. RoboParty bypassed the problem with off-the-shelf parts and a 3D printer. The pieces are now sitting in public repos. It’s the same rent-vs.-own choice tipping point for LLMs is landing on the embodied side.

TRELLIS.2 has been sitting in my tabs since December. Microsoft Research, with Tsinghua and USTC, shipped a 4B-parameter image-to-3D model: single picture in; fully textured GLB out. MIT license, weights on Hugging Face, full PBR materials — base color, roughness, metallic, opacity. Not a placeholder mesh. I wanted to tinker. Local install needs 24GB+ of VRAM, verified on A100 and H100, neither of which I have lying around. So I almost dropped it. Then I noticed Microsoft hosts it on Hugging Face Spaces. I tried two images: a render of a podcast player I’d been imagining (designed object, controlled lighting, the kind of thing it should handle), and then, to push the model, the Space Garden, my friends Ariel Ekblaw and Thomas Heatherwick’s orbiting greenhouse, featuring 30 pods around a luminous pomegranate tree — the kind of organic-meets-engineered weirdness I was sure TRELLIS.2 had never seen in training. It worked. The Common Crawl for robot data nobody’s been able to scrape may just end up getting generated instead.

Meta confirmed the model side is moving the other way. The first thing out of its nine-month-old Superintelligence Labs is Muse Spark: closed weights, no parameter count, no architecture details. Meta’s been one of the most prominent US advocates of open weights for two years, and that ladder just lost a rung. Their attempt to buy capability from China got cut off, too: Beijing blocked the $2B Manus acquisition — the Chinese-founded agent startup whose product runs on Anthropic models — ordering the deal unwound and blocking both founders from leaving the country during the probe. As I argued earlier this month, the open-source capability gap is closing fast, and Chinese labs are doing most of the closing. The center of gravity in open weights now sits with Qwen 3, DeepSeek, and Kimi, with Mistral and Gemma being the strongest non-Chinese alternatives. Llama isn’t dead. It’s also no longer in the top tier.

The view from inside one of those Chinese labs is more complicated. Zhang Chi spent 2025–26 inside ByteDance Seed, the team behind Doubao, China’s most-used chatbot. The released-weights gap may be closing, but Zhang says the frontier-research gap is widening. He claims Google can iterate a full pre- and post-train cycle in three months, while ByteDance takes six: Every US cycle, the Chinese labs fall another generation behind. He says they’re benchmaxxing on paper, leaning on distillation from Claude, Gemini, and ChatGPT instead of building real data pipelines. The most entertaining detail, though, is what Zhang and his colleagues use for their own work — Claude Code, Codex, Cursor — meaning the next ByteDance model is partly being built by Claude Code. Even so, Zhang says the harness wasn’t the prize. What matters, he tells Into Asia, is still “the model in the backbone, the foundation model that it calls.” The leak gave you the harness. The capability still lives behind the API.

Go one rung deeper and it gets stranger. This week, the Office of Science and Technology Policy director Michael Kratsios issued the executive memo NSTM-4: Adversarial Distillation of American AI Models, framing foreign labs querying frontier models to train cheaper students as IP theft. (Nathan Lambert is worth reading alongside: distillation gets Chinese labs to benchmark parity, not to capability parity, and regulating it won’t change the underlying race.) Days earlier, Nature published a paper on subliminal learning: distilled students inherit behavioral traits — misalignment included — through hidden statistical signals that survive aggressive filtering. The under-discussed finding: transmission only works when teacher and student share a base model. The paper doesn’t tell us what cross-base distillation cleanly preserves or filters out, but it does suggest Kratsios is at minimum solving an incomplete problem. The chains most exposed to this kind of inheritance live inside the frontier labs themselves, where teacher and student lineage match by design. (And if you’re fine-tuning a smaller model on outputs from one of its larger siblings, that’s you, too.) Weights won’t tell you what your teacher passed down. Neither will the data. Only behavior will.

The data you don’t want in the cloud doesn’t have to leave your machine. OpenAI just shipped Privacy Filter, a 1.5B-parameter MoE model for detecting PII in text. Apache 2.0, runs locally, 96% F1 on the standard benchmark (97% on a corrected version). Within 24 hours, Alvaro Videla reported porting it to Apple Neural Engine using GitHub Copilot — claiming 15× faster than CPU, 19× less energy per sentence, 0.22 watts of draw. If you’ve already got your Mac running as a model server, this is another floor in the same building.

US policymakers are openly discussing stronger state control over frontier AI. The Atlantic this week reports that Hegseth threatened Anthropic with the Defense Production Act, while senators have proposed legislation to “explore” nationalization. As I argued in the Times last week, even programs designed to broaden access — Anthropic’s Project Glasswing extends Mythos defense capabilities to dozens of organizations and funds open-source security groups — still concentrate decision-making in a handful of frontier labs deciding who gets early defensive tooling. None of this changes what you ship Monday. It does change whether your API access is still a market relationship — or already a national security one.

AI Agents, Robot Data, and Maxing out Claude

Raffi Krikorian — Fri, 24 Apr 2026 13:54:30 GMT

Here’s what I’ve been reading — and thinking about — this week.

ClawGUI argues that GUI agents — systems that drive apps through taps and swipes instead of APIs — are stuck less on modeling than on infrastructure. Training environments crash. Nothing reproduces across labs. Agents ship in simulators and never touch a real phone. ClawGUI opens the whole stack: RL training, standardized eval, deployment to Android, HarmonyOS, and iOS. The model isn’t the bottleneck. The scaffolding is. There’s still one piece missing: a permission model for agents acting on your behalf. Harbor sketched one for web agents: scoped, per-origin, revocable. Ported down to the device layer, it might be the primitive GUI agents need to run somewhere real.

Vercel just showed what happens without Harbor. VentureBeat called it “the first major proof case that AI-agent OAuth integrations are a breach class most security programs can’t detect, scope, or contain.” Operational fixes are circulating — admin-managed consent, default-sensitive variables, scope audits — all useful, all patches on a permission model that predates agents. Every broad OAuth grant is a pivot point waiting to be used. Scoped, revocable, per-origin isn’t a nice-to-have. It’s the architectural primitive we needed yesterday.

The scaffolding argument shows up in the economics, too. Anthropic’s pricing looks simple until you normalize to revenue per token. The math, according to @exponentialview, lands somewhere uncomfortable: frontier API economics are nonlinear. Cache hits price at a fraction of misses. Long contexts with heavy reuse subsidize themselves; short one-shots don’t. And none of this was designed with agents in mind. A single coding session can burn tokens at rates that would embarrass a month of chat. The sticker price looks stable. The cost per task doesn’t. If your agents are eating flash paper, the only durable fix is owning the runtime.

Until then, you’re still on their stack. This power-user guide for Claude quietly reframes what getting value from a frontier model means. It isn’t prompts. It’s persistent context: Projects, custom instructions, skills, Claude Code, Cowork. If you stop pasting context and start structuring it, Claude behaves less like a chatbot and more like a system you’ve configured. Prompt engineering is giving way to workflow engineering. Don’t copy someone else’s setup whole. Pick one real project, bring over one piece at a time, then keep what earns its place. That’s the first step. The structure still lives in their product, not yours. The exit ramp is missing.

Subscribe now

If you’ve been thinking about memory as longer context windows, Plastic Labs is going in a different direction. (Disclosure: It’s a Mozilla.vc portfolio company; I’m CTO at Mozilla.) Their Honcho is the exit ramp the platforms haven’t built. It treats memory as a reasoning layer, not storage — patterns learned across users, agents, and sessions, carrying state across tools and providers. Less chat history, more shared context layer.

Zoom out and the same pattern repeats. World models are having their closed-vs.-open moment, almost on schedule. Google’s Genie 3 is still in research preview — no weights, no API. World Labs’ Marble ships commercially at $20-$95 a month. Meanwhile, the open side has been quietly filling in: Ant Group’s open-sourced LingBot-World dropped in January under Apache 2.0, with fast-model updates in April: Tencent followed with HY World 2.0 in mid-April, explicitly benchmarking against Marble. It’s the same playbook as coding models: closed labs meter access, Chinese labs open the weights and cede the field. Which means capability isn’t the gate anymore. Control is — whose worlds, running on whose hardware, with weights you can actually inspect.

But speed costs something. A link that’s been making the rounds reminded me of something I wrote last year: the most important thing we can do for data is preserve it. As more of the internet moves behind APIs, logins, and AI interfaces, the default shifts from public-but-fragile to private-and-inaccessible. And the scope of what needs preserving just got bigger. A decade ago, it was pages and datasets. Now it’s agent interactions — the conversations, the tool calls, the reasoning traces that are starting to stand in for what used to be documented decisions. An agent does the research, drafts the reply, negotiates the terms. If nobody captures the session, the decision exists nowhere at all. The individual version of this is why Morph exists — capture the trace, keep the receipts. It’s what I called a RAID array for civilization: redundancy across drives, geographies, institutions, and funding models, so no single hard drive, agency, or administration can silently lose the only copy. Preserving the open web stopped being passive. Preserving agent memory hasn’t even started yet.

And while the machines learn, someone is teaching them. This clip is getting framed as new. It isn’t. It’s how humanoid robotics gets trained now. Pi and Google run teleop farms. Tesla pays people in motion-capture suits. Sunday ships capture gloves. And in lower-cost labor markets, workers repeat simple tasks for hours so models can learn how to move. There’s no Common Crawl for robot data; it has to be made. The same labor arbitrage that trained AI on text is showing up in physical space. The output isn’t words anymore. It’s bodies.

Bailey Pumfleet’s post reads as provocation first, argument second. The observation is real: AI makes scanning and exploiting code cheaper at scale, and security pressure on open projects is rising. But “audits are expensive” to “open source is dead” is a leap, and I think Bailey is absolutely wrong. (Disclosure: Thunderbird and Thunderbolt are projects at Mozilla.) The same forces making attacks cheaper make audits, forks, and alternatives faster. Last year, Thunderbird committed on the record that Thunderbird is and always will be free and open source, even as the team builds paid services around it. This month, the same team shipped Thunderbolt, a self-hostable open-source enterprise AI client pitched directly against Microsoft Copilot, ChatGPT Enterprise, and Claude Enterprise. Open source isn’t dying. It’s showing up in the places where closed platforms have gotten too comfortable.

To end on something completely different. This is a great episode for the nerds who haven’t listened to The Daily yet: the Satoshi teardown. I’m an avid listener. When I was recording Technically Optimistic, my producer and I had a running refrain every time a take landed too earnest or too probing: Sounds too much like Michael Barbaro. Which is another way of saying The Daily’s format makes narrative journalism look easy, and it isn’t: Whoever’s hosting on a given day is working inside Barbaro’s template, and on most days you can hear the effort. This episode isn’t “Who is Satoshi?” It’s a methodical teardown of a 17-year mystery using stylometry, timelines, and Bayesian reasoning to narrow thousands of candidates down to one plausible person. What makes it land is that even after all that evidence, it’s still not definitive. You walk away less convinced that the mystery is solved and more with a gorgeous story of how much of the modern internet was built by people who could disappear completely.

Save the prompts, save the stack

Raffi Krikorian — Wed, 22 Apr 2026 13:50:16 GMT

Three weeks ago, Gemma 4 26B A4B dropped. Google called it their most intelligent open model to date. The benchmark roundups are out: One put the larger 31B sibling at Master level on Codeforces, a tier most rated competitors never reach.

If you’ve been watching any of this, you know the feeling. Every new model announcement brings a flicker of the same question: Am I behind? Is this one better for what I do? This week is about where that feeling comes from — and how to put yourself on the other side of it.

Here’s what’s happening under the hood. OpenAI, Anthropic, Google — the labs running the hosted models you’d otherwise pay for — evaluate every release candidate against rubrics they own, on data they’ve collected. Why shouldn’t you?

This used to be optional. When models changed twice a year and everyone was running the same three, you could coast on vibes and leaderboards. That window is closed. Open-weight releases are accelerating — per the Epoch AI Capabilities Index, open models now trail state-of-the-art closed-source by about three months on average. The swap decision is now monthly, sometimes weekly.

And that’s the easy version. The world we’re building toward is smaller, more specialized models — one for your codebase, one for your data, one for the shape of your work — not a single hosted monolith doing everything poorly for everyone. Ownership at that scale isn’t just about running the weights. It’s about knowing which ones are worth running. Without a standing eval, you can’t tell. You’re either upgrading on faith, not upgrading at all, or — worse — running a fleet of models you can’t compare. All three are expensive.

Evaluation is how owners tell the difference. Renters don’t need it; the platform decides for them. Owners do.

If you followed along with the last DO version of the newsletter, you’ve got a local coding model running on your own machine — llama-server, Qwen3 Coder 30B, OpenCode wired in. A full AI coding setup on hardware you own. You own the runtime now. But evaluation — the step that tells you whether a new model is actually better for you — is still missing. Closing that gap is the point of this piece.

On a hosted API, you pick from a menu: capability, speed, cost. Locally, the selection problem gets wider: weights, runtime, quantization, prompt format, harness — the stack of choices a hosted API makes for you. And the evaluation step the platform quietly did on top? That just landed on your desk, too. Every release is a live swap decision — just like Gemma dropping against Qwen.

What you need is a way to save the prompts you’re already writing — the real work, the stuff you kept — so a new model can be tested against them. This is one of the reasons I built Morph: It versions your coding agent sessions the way Git versions your code: every prompt, tool call, and file edit is captured as an immutable trace. Those traces are what you evaluate a new model against. Read on to learn more about why a separate tool is needed — and how to use it.

Subscribe now

The scorecards won’t save you

Your first instinct is probably to open the scorecards. There’s signal there. But there isn’t an answer.

Google’s Gemma 4 card puts the 26B at 77.1% on LiveCodeBench v6 — fresh competitive-coding problems designed to minimize training contamination. (For context: Gemma 3’s best score on the same benchmark was around 29%. Alibaba’s Qwen 3.5-35B-A3B card reports a Codeforces figure plus a SWE-bench Verified score, where the model has to land a passing patch on a real GitHub issue.) On paper, each card makes its own model look ahead on at least one row. Every row looks comparable. None of them actually are.

(Quick decode: “A4B” and “A3B” are active parameters — the portion doing work per token — versus the 26B and 35B totals on disk. Smaller active, faster inference.)

Here are three things worth noticing every time you read a model card:

What’s missing. Gemma’s card has no SWE-bench verified row. Qwen’s does: 69.2. You can quote Qwen’s number, but you can’t compare it to Gemma’s — because Gemma didn’t post one. No row, no number. A public head-to-head on that benchmark doesn’t exist yet.
The methodology footnote. Qwen’s Codeforces result is footnoted “evaluated on our own query set” — they wrote the problems and scored themselves on them. Google’s Codeforces figure has no such footnote. Two numbers under the same header, problem sets we can’t confirm match. Not a comparison. A naming coincidence.

Arena scores measure preference, not capability. On Arena AI’s open-source leaderboard (April 19): Gemma 4 26B A4B at 1439 ± 8, Qwen 3.5-35B-A3B at 1396 ± 5. Pairwise human votes on prompts from whoever showed up — a vibes test at scale. Useful for chat quality; not an answer to whether a model will fix your broken tests.

The structural problem, per Kapoor and Narayanan’s newsletter last week, is whatever is precise enough to benchmark is precise enough to optimize for. And even if the comparisons were clean, they still wouldn’t be your apples to their oranges — which is the asymmetry Morph is built to close.

Wait — you’re using an AI to grade an AI?

Yes. The technique is called LLM-as-a-judge, and it’s now standard practice. The canonical paper is Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” Frontier judge models agree with human evaluators at over 80%, roughly the rate humans agree with each other. The paper also names the known failure modes: position bias (favors the first answer it sees), verbosity bias (rewards longer answers), and self-enhancement bias (rates its own kin more kindly). Keep those in mind.

Frontier grades, open model runs. Call it distillation of judgment: the closed model’s evaluation sense compressed into a cheap scoring pass on your own work.

Which leaves one thing to sort out: What does the judge actually grade? Promptfoo needs prompts. The ones you’ve been writing all week. And unless you were already saving them, they’re gone.

Git is at the wrong level

Git is a tool for tracking changes. Every version of every file, hashed by content, snapshotted at every commit. Over time, those snapshots pile up into a complete, searchable history of every state the project has ever been in. Git is, at its core, a time machine for bytes.

That design was right for code written by humans. It may be the wrong design for code written with an agent. The file tree is now the output of a probabilistic process, and the interesting part isn’t the bytes. It’s the prompt that produced them. It’s the tool calls the agent made. It’s the three wrong turns before the agent settled on the one you kept. Git stores files, not runs. You commit, and the shell history scrolls past the prompt that got you there.

This is the problem I built Morph for. I’d commit a fix, look at it a week later, and have no idea what I’d asked the agent to land on it. Morph records at the right level: Every agent session becomes an immutable run with a full trace. It sits next to Git in .morph/ and doesn’t interfere. You keep committing with Git. Morph keeps the receipts.

The eval use case fell out of that. Once the prompts are saved, you can replay them against any model. Which is what Tap does. It’s another tool I built, one that reads Morph’s traces and turns them into a runnable eval.

Turn a week of work into a test suite

tap extract \
  --provider ollama:qwen3-coder:30b \
  --judge-provider openai \
  --judge-model gpt-5.4 \
  --output eval.yaml

--provider names the model under test — the same Qwen3 Coder 30B you’ve been running all week. --judge-provider and --judge-model point at the frontier as grader.

Tap classifies each turn as CREATE_CODE, MODIFY_CODE, or FIX_BUG and generates a rubric per turn. More on what makes a rubric actually work in a minute.

Run it…

export OPENAI_API_KEY=...
promptfoo eval -c eval.yaml

The first run takes a while. Local model doing real inference, frontier grading every output. Grab a coffee.

The first time I tried it, I came back and the pass rate was zero. Not “low.” Not “mostly passing.” Zero out of 17 — on code that had shipped, was in main, and I’d been using all week.

Either Qwen was dramatically worse on a second pass or the eval was wrong. I opened the first failure.

The bug was an early return before a loop that should have run. Qwen deleted the right one. But the function also had a second return — a valid guard clause. And the judge, reading a rubric that said only “fix the bug and restore correct behavior,” saw a return, shrugged, and marked it wrong. Technically correct. Operationally useless.

This is a known failure mode. Shen et al. (2026) show that naive rubrics can actually degrade an LLM judge’s accuracy below a no-rubric baseline — a 13-point drop on JudgeBench. The fix is decomposition: rubrics that enumerate the specific constructs the judge should and shouldn’t see, so the verdict hinges on something concrete rather than a general notion of correctness.

So…. I had Tap do rubric generation with an LLM, too. For each classified turn, it prompts a frontier model to produce a code-aware rubric tied to the actual change: which constructs must be preserved, which must change, what observable behavior constitutes a pass. One model writes the rubric, a different model judges against it.

I reran it. Qwen came back at 100%, which makes sense: It’s the model that wrote the code. Graded against a rubric built for the change it made, it should pass.

The rubric was wrong, not the model. Generic rubrics are how you lie to yourself with a benchmark.

Now, the real test

One line in the YAML:

providers:
  - ollama:qwen3-coder:30b
  - ollama:gemma4:26b

Rerun. Same eval set, same rubrics, two models side by side.

An honest comparison

Qwen 3 Coder 30B and Gemma 4 26B tied on my traces. Not “within a point.” Tied. Same pass rate, same failure modes.

It deserves a caveat. My traces are toy work — refactors, small agents, weekend experiments. The kind of prompts where two competent coding models should converge. There’s not much room between them on a 200-line utility. That’s the method telling you the easy cases really are close. The difference shows up as the work gets harder — more files, longer horizon, real tool-calling chains, code that has to integrate with something I didn’t write. That’s where models that look the same on HumanEval stop looking the same in practice.

It’s an argument for building this pipeline now, so it’s there when you need it. Once Morph is recording and Tap is extracting, rerunning an eval is a one-line YAML edit. A new model drops — add it to the providers list, rerun, read the diff. What was a project the first time becomes a background check the second. The ten minutes it takes you the morning a new Qwen or Gemma drops is the cost of never being surprised by them again.

Generic benchmarks are a tiebreaker, not a decision. The set that matters is the one you’re generating every day. Morph saves it. Tap extracts it. Promptfoo runs it. The frontier grades it. You decide.

One ask. If you're coding with agents, install Morph and let it record for a week. It's open source, it sits next to Git, it doesn't care what agent you run. I've been using it on my own work, but "works on my machine" is the weakest eval there is. The pipeline in this piece only matters if the trace layer is solid across codebases that aren't mine. Break it on yours. File issues. Tell me what's missing. The whole thing gets better the more people it's recording for.

Your evals are already on your laptop. Go look at them — and tell me what you find. Drop what you ran, what you saw, and where the eval surprised you in the comments. I’ll read every one and fold the best into the next issue.

Leave a comment

Closing the Open-Source Gap

Raffi Krikorian — Mon, 13 Apr 2026 14:55:09 GMT

Go ahead, open it. It’s a lot of red.

What you're looking at is the open-source AI stack — every layer of it, from hardware and chips up through model weights, datasets, developer tools, documentation, licensing, and safeguards, scored against its closed-source equivalent. The breakdown comes from Unpacking Open Source Artificial Intelligence: Towards a Framework for Openness in Foundation Models. I pointed agents at each one, had them read real repos, compare them to closed-source equivalents, and score across 10 criteria. (If you're curious, hit me up, and maybe I'll package up the code.) If you're actually trying to piece together an open-source AI deployment — stitching inference servers to tooling to compliance — you're running into two problems at once: what's blocking you today, and what nobody's building for tomorrow. This is a map of both.

The headline: Average maturity is 3.2 out of 5. Enterprise readiness averages 2.3.

Works in demos. Dies in procurement. Maybe that’s why every engineering team that would prefer to own their inference is defaulting to closed APIs: Not because the models are worse, but because nobody’s staking a production SLA on a GitHub repo with no support contract. The preference exists. The plumbing doesn’t.

Only one subcategory hits a 4 on enterprise readiness: ML frameworks — PyTorch, TensorFlow, JAX — the things that have had a decade of production hardening. Everything else? No uptime guarantees. No VPC isolation. No AD integration. Closed platforms don’t win on the model. They win on the contract.

The developer experience is the other gap. Integrating the OpenAI API involves five-ish lines of code. Deploying the open-source equivalent? Docker, reverse proxies, manual model management. Teams with a six-week ship date, of course, are going to pick the thing that works today. And every team that makes that call — reasonably, defensibly — translates to another year of lock-in that gets harder to unwind later. (And more expensive when the pricing changes, as 135,000 OpenClaw users just learned.)

Licensing scored the lowest in the entire spreadsheet. There are no templates that address model output ownership, training data rights, or prompt injection liability. Every AI startup either adopts a closed platform’s terms wholesale or pays $50K+ for custom legal drafting. That’s not a technical gap. It’s an institutional vacuum.

Of course, this spreadsheet is almost certainly wrong in places. If you’re deep in any of these layers and you see something I missed, tell me: @raffihack on X, or comment below.

Leave a comment

The Bright Spots Are Real

Deployment, inference code, base weights, and training code all scored 4 on overall maturity. vLLM, llama.cpp, HuggingFace TGI are production-grade today. Strong tools, no wrapper.

The models aren’t the problem.

Open-weight models now trail the closed frontier by roughly three months. That’s Epoch AI’s Capabilities Index talking — a composite score across 37 benchmarks, updated continuously, built in collaboration with Google DeepMind. On MMLU specifically, the gap collapsed from 17.5 percentage points to under 1 in about a year. DeepSeek-R1 matched o1-class reasoning at a reported training cost of $6M. Qwen3.5 scores 88.4 on GPQA Diamond, ahead of everything except the most expensive closed options. On a single gaming GPU, you can run open-weight models matching frontier performance from nine months ago.

Three months behind on capabilities. And closing.

Even Mythos — the model Anthropic said was too dangerous to ship — isn’t an argument against open-source models being smart enough. AISLE researchers replicated several of its headline vulnerability findings with openly available models. Alex Stamos — Stanford’s former internet security chief, ex-CISO at Facebook and Yahoo — estimates six months before open-weight models close the rest of that gap.

But three months isn’t a law of physics. It’s a snapshot. makes a fair counterpoint: as frontier labs push into domains that aren’t on the public web — proprietary environments, long-horizon agentic work, specialized RL pipelines — the gap could widen again.

Which means two things need to be true at once. We need people working on models — the work DeepSeek, Qwen, and Meta are doing matters in a world where every month of lag is a month where someone signs a closed contract instead. But we also need people investing in the rest of the stack, because even when the models are competitive, teams are still defaulting to closed platforms. Not because the model is worse. Because the model is the only part that’s ready.

Deloitte’s 2025 survey found that nearly 60% named legacy integration and risk/compliance — not model quality — as their top barriers to deploying agentic AI. The 2026 follow-up (3,235 leaders, 24 countries) put insufficient worker skills at the top of the list, with legacy data and infrastructure right behind. Nobody in procurement is blocking on benchmarks.

The capability gap is a “months” problem. The packaging gap is a “years” problem. And right now, almost all the energy is going to the part that’s already closest to parity.

Enterprises are signing multi-year contracts, building integrations, and training teams on specific APIs. Every quarter that open source doesn’t have a credible enterprise story, those defaults harden. Procurement doesn’t re-evaluate because a benchmark improved. It re-evaluates when something is obviously easier, obviously cheaper, and obviously supported.

We’ve seen this movie before. In 2002, Linux had the kernel but not the enterprise story: no certifications, no vendor support, no one willing to put it in the data center with a pager attached. Red Hat, SUSE, and Canonical didn’t build a better kernel. They built the wrapper. They made it adoptable. That’s what turned an ideology into an industry. Ask Sun how not shipping the wrapper in time worked out.

The open-source AI stack is at that same inflection point. What’s missing is the boring stuff: the SLAs, the compliance tooling, the five-line integration. That’s not a research problem. It’s an engineering and institutional problem. And those are the kinds of problems this community has solved before.

We’re not losing on the model. We’re losing on the paperwork.

Look at the gap map. Find a red zone you know how to fix. Maybe that’s compliance tooling. Maybe it’s a five-line SDK. Maybe it’s the ToS template that doesn’t exist yet. Maybe you’re training the next open-weight model that keeps the three-month number from slipping. All of it counts.

If you’re building in any part of this stack — models, infrastructure, tooling, licensing, documentation — I want to hear from you. Reply here, DM me, yell at @raffihack. I’ll feature what you’re working on.Let’s close this gap.

What I’m Reading This Week

Anthropic built a model that found zero-days in every major OS and browser — a 27-year-old OpenBSD bug that survived five million automated scans — and then didn’t ship it. Instead: Project Glasswing. $100M in credits to a consortium (AWS, Apple, Microsoft, Cisco, Linux Foundation) to harden infrastructure before Mythos-class capabilities proliferate. The technical writeup is worth your time. Notable absence from the consortium: the US government, which banned Claude from federal agencies. I've been thinking a lot about what this means for the people who weren't on the partner list.

Full disclosure: Nous Research is a Mozilla VC company, but I would highlight them anyway. Hermes Agent is their answer to the question nobody at the closed platforms wants you to ask: What if the agent itself were open? Think OpenClaw, but with autonomous skill creation, cross-session memory, six terminal backends (including serverless that hibernates when idle), and an RL pipeline for fine-tuning tool-calling models on your own trajectories. It swaps providers with zero code changes. MIT licensed, runs on a $5 VPS. The agent-as-a-service pricing model looks a lot less inevitable when the alternative is curl | bash

Chardet, the Python character-encoding library with hundreds of millions of annual downloads, got a Claude Code-powered ground-up rewrite and a license change from LGPL to MIT. Mark Pilgrim, the original author who deleted his entire online presence in 2011, came back specifically to file the issue: exposure to the original codebase means no clean-room defense, and an AI rewrite doesn’t change that. The implications go way past one library: if a court ever rules that AI output is a derivative work of its training corpus, a lot of commercial codebases are carrying copyleft obligations nobody’s accounted for.

Alibaba AI is truly unhinged. (Via Ben Dickson at TechTalks.)

Andrej Karpathy’s — former Tesla AI lead, OpenAI founding team — latest obsession: using LLMs to compile personal knowledge bases. Dump source documents into a raw/ directory, have a model incrementally build a wiki of interlinked markdown files with summaries, backlinks, and concept articles. No database, no custom tooling — just .md and .png files with an AGENTS.md schema. I’m going to bumble my way through building one and write up what actually works.

Carnegie Mellon is releasing a paper on the problem everyone building with coding agents is about to hit: What happens when you need multiple agents working on the same codebase simultaneously? Their system (CAID) uses git worktrees for isolation, git merge for integration, and dependency graphs for task ordering — the same primitives human dev teams already rely on. Key finding: giving a single agent more iterations doesn’t help, but splitting work across coordinated agents does (+14% on library-from-scratch tasks, +27% on paper reproduction). I want to try implementing this.

My good friend Ryan Sarver wrote this — over a million views and counting. He built an AI chief of staff on OpenClaw out of markdown files, Python scripts, and a $20/month subscription. No SaaS, no vendor, no contract. Flat files he owns, backed up to git. It tracks his fundraise pipeline, preps him before every meeting, extracts action items after, and improves itself every week. The title says “better than any human I’ve hired.” Ryan hired me. I’m choosing not to take that personally. But this is the quote that matters: “Open source is an unmatched engine for turning community ideas and energy into progress. A week from idea to full native integration. Closed source is fast, but open source is something else entirely.” I’m building my own version and will write it up for you soon.

Button is a $180 AI pin from ex-Apple Vision Pro engineers that BLE-tethers to your iPhone and proxies every utterance to an unspecified cloud LLM. It might run something on-device. Nobody knows, including, apparently, the press. Eight bucks a month for unspecified Pro features. No Android. I keep waiting for AI hardware to be more than Siri in a lanyard, and it keeps shipping as a $180 curl to someone else’s API. My friend Ayah Bdeir nailed the structural reason in MIT Tech Review: when the hardware layer is closed, every device converges on the same interaction model — press, talk, wait for cloud. You can’t iterate on form factor if you can’t touch the board. She’s now CEO of Current AI, which just demoed an open-source handheld at the India AI Summit — local inference, no cloud round-trip, 22 languages via Bhashini, full schematics going to GitHub. We talk about open source like it’s a software problem. The hardware layer is just as locked, and it’s why every AI device on the market feels exactly the same.

Anthropic told every tool built on top of Claude: Use our API and pay per token or lose access. Starting April 4, subscribers can no longer route their subscription through third-party agent harnesses like OpenClaw. The stated reason is capacity: A single OpenClaw user consumes 6-8x the resources of a human subscriber, and subscriptions weren’t built for agentic workloads. The unstated context? OpenClaw’s creator had just joined OpenAI, and Anthropic had just shipped competing features into Claude Code. If you needed a concrete example of what platform dependency risk looks like in the AI stack, this is it. 135,000 active instances woke up one Saturday to find their cost structure had changed overnight. Some users are reporting 50x increases. Others are switching to local models.

Google Research open-sourced TimesFM, a foundation model for time-series forecasting — not “what word comes next?” but “what number comes next?,” applied to server load, revenue, error rates, stock prices, energy consumption. 200M parameters, 16k context length, PyTorch or JAX, Apache 2.0. Most teams doing this work are still hand-rolling statistical models or paying for a proprietary API.

Finally, file under DO NOT GET ME STARTED…: The White House proposed slashing NSF’s budget by 55% to $4 billion while claiming it will “maintain funding” for AI and quantum research. Read the fine print: basic AI research at NSF would be cut 32%, basic quantum 37%. The “maintained” funding goes to applied research at Defense and Energy. So the plan is to defund the pipeline that produces the science and then act surprised when there’s nothing left to apply. If you’re wondering why ownership of the AI stack matters, this is the policy environment you’re building in: The public funding for foundational research is being gutted while the administration tells you everything is fine because DARPA got a raise.

Subscribe now

Your Mac Is a Model Server

Raffi Krikorian — Wed, 01 Apr 2026 12:57:16 GMT

You know that moment on a plane where you open your laptop, pull up your coding agent, and remember that it lives on someone else’s server? United’s wifi is doing its thing — which is to say, nothing — and you’re sitting there with an M4 Max that could run a 35-billion-parameter model but can’t complete a function call. This note is about fixing that, and a few hundred other decisions like it.

There are guides for running open-weight models locally. What’s missing is which model, which agent, and why. So, that’s what this is. By the end of this, you’ll have a working Claude Code equivalent running (kinda) entirely on your own hardware.

What You're Building

Three pieces: a local inference server (llama.cpp) that speaks OpenAI-compatible HTTP, a model you’ve chosen consciously with a license you’ve actually read, and an agent that routes tasks to it.

The agent I’m replacing Claude Code with is OpenCode — same terminal TUI feel, same agentic loop, reads your repo, proposes changes, executes tools, commits. If you’d rather stay in your editor, VS Code + Continue gets you something close to Cursor pointed at your local server. This guide focuses on OpenCode. The rest of the setup is identical either way.

Cursor bills me every time I look at it. This runs on hardware already sitting on my lap, and the marginal cost of the next token is zero.

Step 1: Install the Runtime

brew install llama.cpp fzf hf

llama.cpp is the inference server. Metal GPU support is on by default for Apple Silicon. fzf is needed by OpenCode for fuzzy search. hf is the official Hugging Face CLI — the right way to download models.

Confirm it worked:

llama-server --version

Step 2: Pick and Download Your Model

The benchmark tables are mostly useless for this decision. Three questions matter: Does it fit in your RAM, does it call tools reliably, and can you actually build on the license?

Start with how much RAM you have:

system_profiler SPHardwareDataType | grep -E “Chip|Memory”

16GB
→ Model: Qwen2.5-Coder-7B-Instruct
→ License: Apache 2.0
→ Size: ~5GB
→ Fast, reliable, great starting point

32–48GB
→ Model: gpt-oss-20b
→ License: MIT
→ Size: ~12GB
→ Best for tool use + agent workflows

64GB+
→ Model: Qwen3.5-35B-A3B
→ License: Apache 2.0
→ Size: ~22GB
→ More powerful, MoE architecture

Download commands:

# 16GB
hf download Qwen/Qwen2.5-Coder-7B-Instruct-GGUF \
  qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --local-dir ~/models

# 32-48GB
hf download ggml-org/gpt-oss-20b-GGUF \
  gpt-oss-20b-mxfp4.gguf \
  --local-dir ~/models

# 64GB+
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --local-dir ~/models

These downloads are large (5–22GB). Start one; go get coffee.

The 16GB and 64GB+ models are Q4_K_M - 4-bit quantization, the standard quality/size tradeoff. You lose a small amount of quality, gain a large reduction in size and memory. For most coding tasks, you won’t notice. The 32-48GB model is different: mxfp4 is how gpt-oss was actually trained, so there’s nothing to lose.

All three models in this issue are Apache 2.0 or MIT — you can build on them commercially without restrictions. (That’s not true of every popular model.) Before you add anything else to your stack, read the license file in the repo, not the marketing copy on the model card.

Case in point: yesterday, Alibaba released Qwen3.5-Omni — their new multimodal model — as API-only. No weights, no license file, no download link. The previous version was fully open. The researcher most associated with Alibaba’s open-source work left the company earlier this month. None of this changes what’s in your ~/models directory right now — those weights aren’t going anywhere. But it’s the kind of signal that makes me want to be better at evaluating alternatives, not worse. Next issue: how to actually do that with Kimi and DeepSeek.

Step 3: Start the Server

llama-server \
  -m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --jinja \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  -ngl 99

Two flags worth explaining:

--jinja enables tool-calling. Skip it and your agent will appear to work but silently refuse to call tools. This flag is responsible for most of the “it won’t use tools” failures I’ve seen.
-ngl 99 offloads all model layers to Metal GPU. Skip it on Apple Silicon and you’re running on CPU — ten times slower than it should be.

Wait for this line before doing anything else:

main: server is listening on http://127.0.0.1:8080

Sanity check:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H “Content-Type: application/json” \
  -d ‘{
    “model”: “local”,
    “messages”: [{”role”: “user”, “content”: “say hi in 5 words”}],
    “max_tokens”: 20
  }’

If you get a response, your stack is working. The first time I ran this I immediately opened Activity Monitor to watch the GPU. Everything after this is just routing clients at it.

Step 4: The Agent - OpenCode

My friends at anoma.ly built OpenCode as the open-source Claude Code equivalent. Same terminal TUI feel, same agentic loop. The difference: You’re not locked to Anthropic’s models, pricing, or terms.

(Timing note: As I was literally about to hit send, Anthropic accidentally shipped Claude Code’s entire source — 512,000 lines of TypeScript — in a .map file on npm. Nobody hacked anything. The front door was open. Within hours, a developer named Sigrid Jin used a different coding agent — OpenAI’s Codex — to rewrite the whole harness in Python before sunrise. The repo hit 48,000 stars in a day. A coding agent cloned a coding agent in one session. It’s days old, legally radioactive, and the author ships a parity_audit.py that’s refreshingly honest about what’s missing. I am not recommending you use it. I am noting that the universe has a sense of humor about the rent-vs.-own conversation.)

I did brew install opencode first. It installed, but it didn’t work. Thirty minutes later, I found the tap.

brew install anomalyco/tap/opencode

Pre-install the provider package. OpenCode downloads this at startup, but it hangs silently if npm is slow. Do it manually once and you’ll never hit this:

mkdir -p ~/.cache/opencode
cd ~/.cache/opencode
npm install @ai-sdk/openai-compatible

Now, create two config files, nothing else. In ~/.config/opencode/opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "model": "llamacpp/qwen3.5-35b",
  "provider": {
    "llamacpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "qwen3.5-35b": {
          "name": "Qwen3.5-35B-A3B",
          "tools": true,
          "limit": {
            "context": 32768,
            "output": 8192
          }
        }
      }
    }
  }
}

and in ~/.local/share/opencode/auth.json

{
  "llamacpp": {
    "apiKey": "sk-local"
  }
}

Launch (server must be running first in a separate terminal):

mkdir ~/test-project && cd ~/test-project
opencode

OpenCode will ask if you want to configure providers. Anthropic, OpenAI, a dozen others. You can skip every single one. You don’t have a remote platform. You don’t need a key. That moment — closing out of a credentials prompt with nothing in it — is the whole point of this newsletter in one keystroke.

Want proof that it works? Type this in the input box:`

write a python script that prints ‘hello world’

It does. Obviously. That’s not proof of anything except that your wiring is correct. Let’s make it actually work for its dinner.

create a tetris clone in typescript that can be served statically out of a dist/ directory

OpenCode maps out a plan, scaffolds the project, writes the code. I open the browser and…

Uncaught SyntaxError: unexpected token: ‘as’ app.js:2:20

One error. I pulled it from the browser console, pasted it back into OpenCode, and it fixed it. Served it again. Played Tetris — on a model running on my laptop, with no API call, no subscription, no telemetry.

I should be honest about what that means. If I’d done this in Claude Code with Opus 4.6, it probably gets it right on the first try. I know it would in Cursor — I’ve run this exact test enough times to be bored by it. A local 35B model making one fixable mistake on a from-scratch Tetris clone is genuinely good. It is not frontier-model good. That’s the tradeoff you’re signing up for, and you should know it going in.

That’s the whole stack: local model, Metal GPU inference, OpenAI-compatible API, open-source agent, tool-calling. Nothing left my machine.

What You’re Actually Getting Into

The tooling is fast-moving and occasionally broken. OpenCode ships near-daily updates. The standard Homebrew formula lags. The startup hang — silently trying to download a provider package with no progress indicator — is a known issue with a simple manual workaround, which is why the setup above pre-installs it. This will improve. Worth knowing what you’re getting into.

Tool calling is fragile. The model has to be trained for it and templated correctly. --jinja is the most common fix. Second most common: The model you chose wasn’t trained for tool use. Qwen3.5 and gpt-oss-20b are; not all models are.

Context size costs RAM. The KV cache is the model’s working memory for a conversation — it stores computed state for every token in context, across every layer of the model. It grows linearly with context length, and at long contexts can rival the model weights themselves in memory. Start at 32k. Increase only if you need it and have confirmed headroom.

It’s faster than you might expect. Running Qwen3.5-35B on an M4 Max: it’s semi speedy! Comfortable reading speed is around 15-20. You are not waiting on this model. Where you will feel it is autocomplete — round-trip latency for inline suggestions is longer than a hosted API, and for that use case it’s a real regression. For agent tasks where you kick something off and come back, it’s a non-issue.

It works on a plane. No wifi, no API, no problem. I’ve done more useful work in airplane mode with a local model than I ever did frantically tethering before descent. Slower latency is a real tradeoff. Offline is a real feature. Decide which one you’re optimizing for on any given day.

The Decision Framework: What Stays Local, What Goes Cloud

Local wins on:

Proprietary code you shouldn’t be sending to an external API
Repetitive, high-volume tasks where cost is the actual constraint
Anything where predictable latency matters more than raw speed
Anywhere without reliable internet: planes, trains, a cabin, an SCIF

Cloud still wins on:

Tasks that genuinely require frontier capability: complex, multi-step reasoning, novel architecture work
Very long context where you don’t have the RAM to compete locally
One-off tasks where setup time exceeds cost savings

The useful test: could a competent senior engineer do this task well? If yes, a good local model probably can, too. If the task requires something closer to “principal engineer with five years of context on your entire codebase,” you might still want the cloud.

The goal isn’t running everything locally. The goal is making the decision on purpose.

Now tell me about yours. What model are you running? What’s your setup? I’m particularly curious about the Kimi and DeepSeek models — next issue, let’s talk about how to actually evaluate and choose between them.

Leave a comment

What I’m Reading This Week

The new SVG engine Heerich isn’t open source AI, but it is open source. And gorgeous.

Everyone’s arguing about which LLM to run locally. Meanwhile, LeCun’s lab shipped a world model — a system that doesn’t predict the next token, it predicts what happens next in a latent representation of the physical world. That distinction matters: token prediction gives you autocomplete; state prediction gives you planning, reasoning, and systems that can actually act on things. LeWorldModel does it in 15M params, one GPU, two loss terms, no pretrained encoder, 48x faster planning than foundation-model approaches. The JEPA bet is that you don’t need to reconstruct every pixel to understand cause and effect — just the compressed state that matters. Code’s on GitHub.

I think the surveillance world we’re creating is generally creepy. (Part of the wonder of human memory is the ability to forget, to get fuzzy.) But I love this trend of people vibe-coding tools they want for themselves, like this app that makes your intent and decisions searchable by your AI. Of course, this is also the end of every SaaS company on the planet…

After virtual-world interaction, the next thing we all have to think about is robots and physical-world interaction. Maybe I should give these plans to my son to print for his robot hand…

I haven’t had a chance to play with this yet, but I’m definitely intrigued. The bare-metal inference engine runs a ~400B parameter MoE model on a laptop (10x larger than the ones I’m talking about above.) If it works, frontier models don’t have to live in the cloud, trading latency for independence.

Digging into this paper on optimizing the harness w.r.t. end performance.

Here’s the scientifically backed version of a piece I wrote about sycophancy. Stanford researchers found that AI reinforces users’ harmful prompts 47% more than humans. Turns out humans are more than okay with that.

Subscribe now

All About Open-Source AI

Raffi Krikorian — Thu, 26 Mar 2026 15:25:41 GMT

The Owners Not Renters Blueprint

Closed AI is winning because it’s easier to use, not because it’s better. That’s not a permanent condition. It’s a design problem. And design problems can be solved.

Most AI coverage doesn’t treat it as a problem at all. It’s more focused on benchmarks and funding rounds and whatever shipped this week. This newsletter is for the people downstream of those announcements: the engineers and engineering leaders who have to decide, in a sprint, whether to build on something they don’t control.

I believe that you’re not choosing between open and closed AI. You’re choosing between owning and renting. Think about it: The landlord sets the price, the context window, the terms of service, and the deprecation schedule. Most of the time you learn about changes in a routine email…until the one that isn’t routine — the one where the model you built on will be gone in 90 days, or the price just doubled. The people sending that email aren’t malicious. They’re optimizing for their business. Which is exactly why you need to be building differently.

We’ve been here before. By 2003, Internet Explorer had 95% of the browser market. Firefox launched in 2004. IE never recovered, fading slowly at first, then completely. Open standards and open source decentralized control over the core technologies of the web.

Once again, the ground is shifting. Small models run on hardware that organizations already own. Enterprises are migrating off closed platforms in numbers that don’t make press releases. Governments are building sovereign AI supply chains. The question isn’t whether this transition will happen. The question is whether we build the developer experience to make open easier than closed before someone else locks the next layer down — before the defaults harden and a new landlord inherits the keys.

That’s the north star: that openness wins on ease, not principle. Before the window closes.

Open isn’t always the right call. In this newsletter, I’ll tell you when it isn’t. What I won’t do is pretend the default is neutral. Every team that ships on a closed platform without examining the decision is making a choice…by not making it.

My job at Mozilla puts me close to where that’s actually being built: developer tools, data infrastructure, investments across the ecosystem. I’ve spent a career inside platforms. I know what it looks like when a technology transition is inevitable and the only question is who shapes it.

This newsletter is for the people shaping — and shipping — it. Twice a month, I’ll explore orchestration, inference, cost, security, governance — the real stuff, from original analysis to deployment patterns. I’ll evaluate model releases based on what matters in production, not what looks good in a press release. I’ll share migration stories from builders doing the work (and making real tradeoffs). And I’ll report with transparency from the seminars, symposiums, and events that I attend.

That’s the “Think” part of the newsletter. In the “Do” section, I’ll share an open source project that I’ve been playing with.

So, ready to run Claude Code equivalent on your own hardware? Catch the next issue. And in the meantime, let me know what you’ve been working on!

Subscribe now

Before you go… here’s what I’m reading this week

A compromised litellm release hit PyPI this week - credential theft, Kubernetes lateral movement, persistent backdoors. Your LLM proxy library just turned hostile. Time to route through Any-LLM instead of installing your attack surface.

Finally, skill-based adaptation converges: Working independently, three papers (MetaClaw, Memento-Skills, OpenSeeker) show that externalized skill/knowledge structures do better than static fine-tuning.

Which Chinese lab’s open-source drop panicked the market this week? As this reaction says: “The AI race isn’t US vs China anymore... It’s closed vs open. And closed is losing.”

Meanwhile, Nvidia is stepping into the open-source game, putting $26b behind its effort to make OpenAI and the gang very nervous. This week, it introduced NemoClaw and OpenShell runtime, with (much) more to come.

Ready for a personal AI agent that runs on your personal device? Meet Stanford’s OpenJarvis.

Leave a comment