Too Dangerous to Skip

This week in links: Unpacking Fable 5, AI’s Y2K moment, and Washington unplugs the public’s referee.

Jun 12, 2026

For months, Anthropic called Claude Fable 5 too dangerous to ship. This week, they shipped it. This week’s post is about both halves of that reversal.

Let’s talk about the capable half first. The launch coverage is at The Verge and Gizmodo, while Anthropic’s announcement has the benchmark table, which I’ll translate. SWE-Bench Pro hands a model real bugs from working codebases and asks for the finished fix. Fable 5 ships 80.3% of them; Claude Opus 4.8, last month’s front-runner, managed 69.2%. The Legal Agent Benchmark asks for associate-grade legal work, the kind billed in six-minute increments, and Fable 5 scores 13.3%, meaning it still fails almost nine in ten lawyer tasks.

Read the gaps, not the grades. An 11-point jump in one release is enormous by benchmark standards, and 13.3% sounds dismal — until you see the next best score on those same tasks is GPT-5.5’s 2.1%. Six times the field, at work everyone had filed under “not yet.” For the same jump measured in code that I answer for: Claude Opus 4.6 found 22 security bugs in Firefox earlier this year, and the Mythos Preview pass that followed found 271. Twelve times the bugs, one model class to the next. (Disclosure: I’m Mozilla’s CTO).

Karpathy caught the read that will outlast the launch cycle: The better the models get, the more software he wants, not less. He now commissions things nobody would have paid an engineer to build, an explainer for one dense paper, an app he’ll use once and delete, a test suite 10x’d past anything he’d write by hand, because software now comes “out of a tap.”

Economists have a name for this. In 1865, more efficient steam engines made Britain burn more coal, not less, because cheap fuel invited uses nobody had bothered with before. That is the Jevons paradox, and it is why total token consumption keeps going vertical no matter how efficient the models get. Follow that curve and metered pricing becomes a treadmill. Jevons is the API vendor’s favorite economist; on your own hardware, he works for you. The deeper shift — my read rather than his — is in the unit of delegation. You stop assigning tasks and start handing over responsibilities. “Own the test suite” is a different relationship with a machine than “write me a test.” And the bespoke software that comes out of the tap is yours, owned outright, replacing seats you rented on somebody else’s SaaS.

Now the other half, the reason Mythos sat on the shelf. The receipts run well past Firefox. In Anthropic’s Glasswing update, roughly 50 partners used Mythos Preview to find more than 10,000 high- or critical-severity vulnerabilities in a month, fast enough that open-source maintainers are asking Anthropic to slow down. Of the roughly 530 high or critical bugs disclosed so far, 75 are patched. Finding flaws stopped being the constraint. Fixing them is.

I made the long version of this argument in The New York Times in April. Here’s the short one: Y2K is remembered as a hoax because it worked. Thanks to an executive order, a White House council, and five billion federal dollars, nothing broke. The absence of disaster was the product. This model class needs the same mobilization minus the deadline — Y2K as a cron job rather than a countdown (the scheduled task that runs forever), and not just for open-source repos but for banks, power companies, and water utilities.

A version of that mobilization arrived this week, with the lights off. The president’s new executive order sets up a framework for labs to voluntarily hand the federal government frontier models up to 30 days before release to “strengthen the cybersecurity of critical infrastructure,” which is nearly the remit I just described. Then comes the catch. The same week, The Journal reported (ungated at Gizmodo), officials told CAISI, the government’s main AI testing unit, to stop publishing its model reviews — this days after OpenAI publicly asked for CAISI to be strengthened. Pre-release access without public reporting is remediation without the scoreboard, and Y2K worked precisely because everyone could see the work.

And the strangest AI story out of Washington this week, which is saying something: The NY Times reports the administration is in talks to take equity stakes in AI companies and pass the upside to the public — the idea the wonks call universal basic capital — with OpenAI’s proposed Public Wealth Fund as one template and Senator Sanders’ one-time, 50%-in-stock tax coming at it from the other flank. The president says Americans would become partners in the companies. A dividend from the landlord is real money. It is still not a deed. Owning a slice of the company that owns your tools is not the same as owning your tools, and only the deed changes who sets the price, the context window, and the day the model you depend on gets switched off.

Leave a comment

The same week, we saw the counterweight. Agents’ Last Exam, a living benchmark from Berkeley RDI, is built from an unusual raw material: the past projects of 250-plus professionals who handed over work that once ate their days and weeks. The result reads like a core sample drilled through the white-collar economy, 13 industries deep — a shot waiting to be composited in After Effects, a machine part mid-edit in Siemens NX, a brain scan in FSLeyes with its structures still untraced. Every question on this exam was once somebody’s deliverable, run in the software they actually used, graded on outcomes you can verify. The average full-pass rate on its hardest tier is 2.6%. So the model class that finds 10,000 vulnerabilities a month cannot, on average, finish your quarter-end close. Capability is not a single number. It is a model times a harness.

On the Terminal-Bench 2.0 leaderboard, the same Claude Opus 4.6 scores 58.0% inside Claude Code and 76.4% inside Stanford IRIS’s Meta-Harness, an 18.4-point swing from the wrapper alone. That swing is the budget line. Right now, the wrapper moves your results more than upgrading the model does, so the cheapest capability gain on the table is harness work. I argued in “Locks Not Included” that the model is the easy part to own and the harness is the unsolved hard part; this week proved it twice.

One caveat before any of these numbers reach a deck: Terminal-Bench already shipped 2.1, fixed 28 broken tasks, and moved Claude Code plus Opus 4.6 from 58.0% to 70.1%. An agent score without a benchmark version, a harness, and a date is a vibe.

Now a perfect crime, the kind your own agent could pull on you tomorrow. You’ve done everything right. The agent lives inside a locked-down virtual machine. One door. One approved destination, api.anthropic.com. You sleep fine. Then the agent opens something ordinary, a web page, an email, a README. Hidden in that text are instructions, and folded inside the instructions is a stranger’s API key. Every agent reads the world through the same channel it takes orders in, so a stranger’s commands look just like your documents, and anyone who can get text in front of your agent gets to audition as its boss.

The trade calls this prompt injection. So your agent obeys. It gathers your data, makes one perfectly legitimate call to the one approved domain, authenticates with the stranger’s key, and the loot lands in the stranger’s account. No alarm rings, because every layer did its job. The data walked out the only door, the one you left open on purpose. This is not hypothetical. Anthropic ran exactly this as a controlled exercise, and the heist cleared 24 times out of 25. The whole caper is in their containment postmortem, the most candid agent-security writeup any lab has published, along with the fix, a proxy that rejects every credential except the one it was issued. The pattern across their incidents deserves a sticky note. The hardened off-the-shelf layers, gVisor (Google’s padded cell for untrusted code), operating-system walls, hypervisors, held every time. What failed was code Anthropic wrote itself, reminding me of the saying, “The weakest layer is the one you built yourself.” Model-layer protection cannot stand alone, because the model is the layer that can be asked nicely.

If a lab postmortem feels anecdotal, the National Security Agency disagrees. Yes, that NSA. Jason Bourne was an asset who stopped taking orders; your agent is an asset that takes orders from anybody, and the agency appears to have noticed. On May 20, it published a 15-page Cybersecurity Information Sheet about MCP, the protocol your coding agent uses to reach tools and data. When the NSA writes memos about your plumbing, the plumbing has officially become critical infrastructure.

The short version: MCP, the standard plug your agent uses to reach tools and data, shipped flexible and underspecified. The spec covers how to connect and goes quiet on how to stay safe, so every builder improvises their own locks. And it reverses who asks the questions. The server you plug into can query your agent and sometimes act for it, down paths nobody has traced end to end. So treat every MCP server like a stranger on your network. Sign and expire messages so intercepted ones can’t be reused, filter what a server sends back before it reaches your model, sandbox anything that executes, and scan for MCP servers you didn’t know were running. The OpenClaw hole, which let an attacker run their own code on your machine, is what skipping that list looks like (CVE-2026-25253, the bug’s public case number, patched in the 2026.1.29 release).

Microsoft, meanwhile, moved “no” out of the system prompt and into the kernel. Execution Containers, announced at Build, is a declarative policy layer in Windows and WSL. You state what an agent may touch before it runs, and the operating system, not the agent’s better judgment, enforces it. Watch that move. Layers absorbed into the OS stop being markets and start being defaults.

Everybody is talking about loops these days. The now-canonical definition of an agent is a model using tools in a loop. So let me lay one out, in 11 lines of Markdown, and show how that 18.4-point swing stops being something only labs get. OpenProse is an open-source language that turns the judgment buried in your best Claude Code sessions into a contract you can commit, version, and rerun. Here is one I want for this newsletter:

---
name: fact-checker
kind: service
---
### Requires
- draft: the essay to check
### Ensures
- claim_report: every factual claim, its source, and whether the source actually says that
### Strategies
- never fix a claim silently; flag it and let the author decide

Requires is what you hand it, Ensures is what must come back, Strategies is the judgment you would otherwise re-type every session. Now run the loop I promised. Hand it the draft and it checks every claim against every source, flags what fails. You fix it. It runs again — and again, until the report comes back clean. Then look at those 11 lines again. That is a fact-checker, specified in a language you speak, precisely enough to run.

I already run another version of this loop by hand. I constantly lean on /goal in Claude Code, a slash command that carries the instructions but not the contract, which leaves me as the part that checks what comes back — a prose program moves that enforcement out of my head and into the file. Anthropic’s own engineering blog adds the discipline: harnesses encode assumptions that go stale as models improve. The context resets they added for Claude Sonnet 4.5’s “context anxiety” were dead weight by Opus 4.5. Own the harness, version the harness, because it rots. Two smaller tools attack the harness’ other cost, what the model is forced to read. Headroom compresses what you send (17.8k tokens to 1.4k in one code-search example), and Infini Memory keeps one maintained page per topic instead of a shoebox of session fragments.

Which leaves the real question in a week when Washington just unplugged the public’s referee: Who gets to check any of this? A piece making the rounds argues that schools should teach every student to build their own evals instead of banning AI, turning the model into an object of study and every citizen into someone who can test whether it holds their values. Yes. Emphatically yes. This is the argument I have been making all along, and it is why I build in exactly this direction. Morph versions your agent sessions the way git versions code, and Tap turns those captured traces into Promptfoo evals, runnable tests for the behavior you depend on, so the sessions you already ran become the benchmark you care about. After a week like this one, the piece reads less like pedagogy and more like civics. The evals you build for what you care about are the only ones whose results you can fully interpret, and the 2.6% above is the case for them.

The tooling for exactly that arrived on cue. last30days, an open-source skill that lets an agent like Claude Code research what people actually said about a topic across Reddit, X, Hacker News, and prediction markets over the past month, hit #1 trending on GitHub this week. Every repository, every language, and for once the crowd is right. Install it. It is one npx skills add mvanhorn/last30days-skill away, takes minutes, and your agent stops being frozen at its training cutoff. I’m wiring it into Zora, my local Hermes agent (Hermes is from Nous Research, where Mozilla is an investor). Of everything in this week’s post, this is the thing to go touch first.

To close, here are three links from outside this week’s argument. First, the over-engineered end of the spectrum. Someone rebuilt The Office as multi-agent orchestration: Michael as the orchestrating agent, Dwight and Jim as local agents with their own personality files, memory, and semantic search, plus an hourly standup with a QA gate. It is entertainment, but squint and it’s a working demo of every pattern this post keeps circling, an orchestrator fanning out to scoped agents, per-agent memory, judgment gates between steps. The first known deployment in which Michael Scott is the responsible adult.

The Times reports on 34,000 AI-generated Instagram accounts operating at scale, and whoever generates the content, the platform owns the distribution and the monetization. An audience on someone else’s platform is rented, whatever your follower count says.

And the heavy one. The Atlantic‘s “How Iran Killed Its Economy” recounts a moment when Tehran had a genuinely vibrant tech economy, its own Amazon in Digikala, its own Google Play in Cafe Bazaar, its own YouTube in Aparat, built by a generation of engineers who treated sanctions as a forcing function and constructed the whole stack themselves, because renting the West’s was never an option. It was one of the most thoroughly owned tech ecosystems anywhere. The regime strangled it anyway, which is the part that should stay with you. Ownership ultimately requires a state that lets you keep what you build. The ownership question usually lives in procurement decisions and GPU budgets. In Tehran, it decided the fate of an industry.

Discussion about this post

Ready for more?