The Rent Is Due

Frontier models got pricier. Open models got better.

May 29, 2026

Zapier asked 542 US executives this month if they could swap AI vendors in under four weeks. 89% said yes; 41% said two to five business days. Then Zapier asked the ones who’d actually tried. 58% said the migration failed outright or burned far more time and money than budgeted. Most executives think they can switch whenever. Most of the ones who’ve tried wish they’d moved sooner. Lock-in is real, it’s hard to undo, and the window to get open in is before lock-in arrives — not after.

The pricing just moved the other way too. OpenAI rolled out GPT-5.5 at $5 per million input tokens and $30 per million output — double what GPT-5.4 costs. (A million tokens is roughly an 800-page novel.) Anthropic adjusted Claude Enterprise billing on April 15 so Opus 4.7’s higher inference costs flow through to customers; heavy users report 2-to-3× bills. GitHub paused new Copilot signups, capped existing plans, and removed Opus from the Pro tier. The loss-leader years end with a vendor sending you a bill. Lock-in is what makes you pay it. The cheapest moment to leave was before the bill arrived. The next-cheapest is the next thing you read in this letter.

And just yesterday: Opus 4.8. Same price as 4.7, but the benchmarks moved: 69.2% on SWE-Bench Pro, up from 64.3% for 4.7 and 58.6% for GPT-5.5. A new Fast mode runs 2.5× faster at one-third the cost. The early testimonials are raving: Opus 4.8 “proactively flags issues with the inputs and outputs of an analysis, something other models routinely missed and left to the users to catch.” I’m running it through my Morph and Tap harnesses this weekend, against real session traces from my own coding work. Personal evals incoming. The frontier got better and cheaper on the same day the vendor bills doubled. That is the rent landscape in May 2026: moving in two directions at once.

The cheapest seat in the AI house this month comes with the same lock-in problem, except the landlord is a thief. Chinese students are paying 3 to 4 percent of the list price for GPT and Claude via resellers on the Chinese consumer marketplaces Xianyu and Taobao. It isn’t arbitrage. Per reporting on Oxford researcher Zilan Qian’s investigation, it’s a gray-market supply chain. Upstream operators bulk-register Anthropic and OpenAI accounts using stolen credentials and free-credit farming. The “transfer stations” (中转站, in Chinese developer slang) route traffic through their own gateway servers, often with silent model substitution: you pay for Opus, you get whatever the reseller swaps in. Every prompt and output is logged and resold downstream as training data. Anthropic identified roughly 24,000 fraudulent accounts in February tied to DeepSeek, Moonshot, and MiniMax — three of China’s biggest AI labs. The White House called it industrial-scale distillation: training Chinese models on the outputs of American ones, at the cost of whoever’s typing. Every line of code routed through a cheap proxy is teaching the model that wants your day job. The good news: the legitimate cheap option also shipped this week. Apache 2.0, free weights, no resellers.

Leave a comment

Cohere released Command A+ under Apache 2.0, the most permissive open-source license there is. 218 billion parameters total, weights free on Hugging Face. 128K context, 48 languages, vision and tool use unified. Runs on two H100s or a single Blackwell — the Nvidia datacenter chips that power most of the cloud’s AI. The benchmarks show gaps (it ranks below the frontier on the hardest agentic coding), but the license is the news. Take it. Run it. Modify it. Sell what you build on it. You don’t need permission. You don’t need a renewal. This is Cohere’s first frontier-class model that anyone can deploy commercially without asking. Co-founder Nick Frosst framed it as sovereign critical infrastructure: government and regulated industry running the model in their own data centers, fully cut off from the internet if needed.

The Register ran a Gartner note this month arguing “sovereign cloud is only possible if you’re Chinese or American.” But… a Canadian frontier-class model under Apache 2.0, running on customer infrastructure, is the rebuttal. If “evaluate an open model” has been on your roadmap for two years, this is the week you cross it off.

OpenAI’s Privacy Filter is the smaller release with the sharper architectural message. A 1.5B-parameter Apache 2.0 model, 50M parameters active per token, that runs entirely in a browser tab on your own GPU. You can open your browser’s network tab and confirm nothing leaves the device. This is the exact opposite of the pattern I called local-washing — a 4GB Gemini Nano model that Chrome silently writes to disk while the “AI Mode” pill still routes every query to Google’s cloud. Local model present, the user-visible feature still rented. Privacy Filter inverts that: mask names, addresses, emails, account numbers, and secrets on your device first, then send the cleaned-up prompt to whatever frontier model you want. The architecture matches the claim. There is no longer an excuse to send raw PII to a cloud model. Bolt it in front of your existing stack this sprint — 1.5 billion parameters, Apache 2.0, an afternoon of work.

Two pieces of the local stack worth your weekend. Ahmad Osman’s “Inference Engines for LLMs & Local AI Hardware (2026 Edition)” is what local inference reads like when an infra engineer writes it for other infra engineers. The framing is right: you don’t pick the engine first. You pick the hardware, the workload, and how many people will use it at once. Generating text one token at a time is bottlenecked by memory speed, not raw compute — which is why an M4 Max at 546 GB/s of memory bandwidth competes with much pricier datacenter GPUs on single-user inference, and why the right engine (llama.cpp, vLLM, SGLang, TensorRT-LLM, for the people taking notes) falls out of that one constraint.

On the consumer end, Atomic Chat ships Qwen3.6-35B locally on M-series Macs at 60+ tokens per second — output streaming faster than you can read it, on battery, from 22 GB of model weights sitting in your laptop’s memory. Google’s TurboQuant squeezes the model’s working memory down to 3.5 bits per number with no measurable quality loss, even at 65,000 tokens of context. A Claude API call streams back at about the same 60 tokens per second. Same speed. No meter, no outbound packets. I built a working Canvas physics game in a weekend — parallax scrolling, collision detection, no API key. Six months ago that was a cloud workload. Now it runs on your laptop, at cloud speeds, for $0. The cloud isn’t where AI lives anymore. It’s just where some of it happens to live.

Forty authors. Forty institutions. One vocabulary. A survey titled “Code as Agent Harness” landed this month. The thesis has been forming for months: code stopped being just what AI agents write and became how they work — the substrate the agent operates inside, not the output it produces. The four properties the authors name — executable, inspectable, stateful, governed — are the language auditors and security officers will use in 2027 procurement. Section 5.2’s seven open problems are functionally the procurement checklist. If you’re building an agent and you can’t say yes to all four, you don’t have a product. You have a demo.

A companion paper landed the week before. Chopra et al. on “Beyond Cooperative Simulators” argues that the AI-generated stand-in users most labs test their agents against inherit their base model’s behavior — patient, cooperative, willing to clarify — and so agent evals systematically overstate real-world performance. Real users are unclear, impatient, and reluctant to repeat themselves. The Opus 4.8 release that just happened — the one that “flags issues with the inputs and outputs” — is the first model that the labs have shipped that’s been trained against exactly this gap. The papers describe the problem. The models are starting to fix it. The harness is where the fix lives. Run your stack against the four properties this week. Anything that fails is what your 2027 procurement gets stuck on. Fix it now while it’s still cheap to fix.

The wrong reflex this month came from a place I love. NHS England ordered tech leadership to set every public repository to private by May 11. The stated reason is that Anthropic’s Mythos — a new model trained to find software vulnerabilities — is good at finding them in code it can read, so the code should not be readable. The unstated reason is that someone in senior leadership panicked.

The repos contain datasets, internal tools, front-end resources, and most of the work the team that shipped the NHS COVID-19 app produced — code that taxpayers paid for twice over, and that other public-health bodies were reading, forking, and reusing. Roughly 200 repositories went private before the backlash arrived. The code is also already on the open internet — on archived mirrors, on forks, and in every major dataset that scraped GitHub between 2021 and last week. Closing the repos doesn’t unship the bytes. It makes the maintained version harder to find. It tells the engineers who built the app on principle that their employer has stopped believing the principle. I argued in the Times earlier this year that the world’s most valuable software infrastructure is maintained by people working for free, while the companies building fortunes on top of it never paid for the upkeep. The NHS was one of the rare institutions actually funding that maintenance, on principle. Now it isn’t. And it gives every other public-sector CIO permission to make the same call. UK GDS published guidance on May 14 contradicting the NHS position directly: “You should never close an open repository.” If your security depends on attackers not reading the code, you never had security. You had time. And the clock just sped up.

If you’re the engineer reading this at a public-sector or regulated employer: the fix is the opposite of the reflex. Open more. Document more. Fund the maintenance. The GDS guidance is the playbook. The Cohere release proves the model exists. The only thing missing is the executive who picks up the phone. Be the engineer who hands them the number.

Bill Gurley’s updated essay at P3 Institute is the strategic frame that makes this look even worse. Open source as a corporate weapon against monopoly, traced from GNU and Linux through Android, Kubernetes, and Llama. The line that lodged: Chinese open models may become the global default by 2030. NHS England picked this month to opt out of the only stack that has a non-Chinese, non-American sovereign answer. Cohere just shipped the model that proves the answer exists.

Three essays I’ve been turning over, all landing on the same point: in an AI world, what something is matters as much as what it does.

Alex Imas’s “What Will Be Scarce?” sets the frame. The more we automate, the more we’ll pay for what only people can do. Two reasons. We shift our spending as we get richer — toward things machines can’t make. And we pay extra when we know a person made it. His proof, from experiments with Graelin Mandel: buyers paid a 44% premium for art they knew was made by a human, and only 21% when they knew AI was involved. The provenance signal does work the functional output cannot reach. Twenty-three points of margin live on a checkbox that says “made by a person.”

Yejin Choi and the WEF report on smaller models makes the same argument one layer down — about the models themselves. Boutique LMs trained through interaction, reflecting the values of who built them, vs. the English-dominant monoculture of the frontier labs. Imas’s premium for “made by a person” is one form of provenance. Choi’s case for “made by us, for us” is another. Different scales, same point: the model that knows where it came from is worth more than the model that doesn’t.

Alondra Nelson’s recent piece in Science is the governance answer to the same question. Her argument: AI infrastructure governed by the publics it affects is what lasts. Legitimacy is durability. I served with Alondra on the Mozilla board before taking my current job there, and her work has been the cleanest articulation of how democratic legitimacy attaches to AI infrastructure. Imas says provenance is worth money. Choi says it’s worth building a model around. Nelson says it’s worth governing for.

If you’re building AI-augmented anything, those are the three signals that matter now: who made it, who it speaks for, who governs it. The capability is the floor. The provenance is the product.

The week’s best thing on the internet was not technical. “Claude’s First Day at Dunder Mifflin” on r/ClaudeAI. I won’t spoil it. Whatever you think about AI and entertainment, that one earned the laugh.

Discussion about this post

Ready for more?