Call My Agent
Meet the most important layer in the stack.
You’ve probably already noticed this: The same AI feels different in different tools — sharper in one app than another, even when you know it’s the same model underneath. There’s a reason.
In the first gap map, I drew the open-source AI stack as a vertical thing: chips at the bottom, applications at the top, every layer scored against its closed-source equivalent. That worked for one shape of system, where your application makes the inference call, formats the result, and hands it to the user. Chat boxes, completion plugins, single-turn summarizers — your application is the runtime; the model is one resource it happens to use. Call that direct inference.
But that’s not the only shape anymore. The thing in your editor that ran tests, found a bug, fixed it, and committed the change isn’t shaped like that. Neither is the briefing that pulled your feeds this morning, picked what mattered, and routed it to you. In those systems, your application doesn’t call the model at all. It hands a goal to an agent. The agent breaks down the work, picks the tools, keeps state across turns, decides when to ask a human, and decides when to stop. The agent is the runtime, while the model is one resource among several. Think agent-mediated.
These aren’t stages. They’re peers. A summarization tool doesn’t need an agent. A research assistant can’t work without one. Both architectures will stick around.
Hermes Agent, Claude Code, Cursor, Aider, Cline — all take the side path. The original map only scored direct inference.
I wrote a white paper sketching how this layer breaks down — runtime plane, control plane, the cross-cutting attributes that Basdevant et al. introduced for foundation models, all applied to the substrate above inference. It’s still in draft. Read it and tell me where I’m wrong.
What the harness is actually doing
Terminal Bench 2.0 is the standard leaderboard for coding agents - 89 real tasks, scored on how many the agent finishes. Same model, Claude Opus 4.6, identical weights: 58.0% in Claude Code, 79.8% in ForgeCode. Twenty point spread. A new frontier model release typically buys you 3 to 8% gains. This was just a harness swap.
Imagine a singer in a recording booth. The voice hitting the mic is the model. Compression, EQ, reverb, mixing — everything between the mic and your ears is the harness. Hand the same vocal take to two engineers and you get two different songs.
Vivek Trivedy at LangChain has the cleanest framing in Anatomy of an Agent Harness: Agent = Model + Harness. (And if you’re not the model, you’re the harness.) The harness is everything in the running system that isn’t the weights: context management, tool catalogs and dispatch, sandboxes for code execution, memory subsystems for what to remember and forget, orchestration logic for when to call what, hooks for deterministic safety checks. None of that is the model. All of it shows up in the benchmark.
You can reproduce this on your own machine. Point Cline at the same model Claude Code uses — same weights, same API endpoint — and you get different behavior on the same task. Try OpenCode against Cursor on whatever frontier model they share. Same model, different harness…different result. The harness picks what context the model sees, which tools to surface, whether to read three files or 30 before writing one, what to remember between turns, and what to forget. By the time the result reaches you, the harness has shaped most of what you’re looking at. The model is the last step in a chain of decisions the harness has already made.
It happens in things simpler than a coding agent. The morning briefing I built in last week’s DIY Daily post picks which feeds run, what gets summarized, and what gets dropped. Swap the model behind it and the prose reads slightly differently. Swap the harness and you get different feeds, different memory of what I’ve already seen, different rules for what’s worth a line. The briefing is basically unrecognizable.
Last week I pointed at Karpathy’s microGPT, the complete ChatGPT algorithm in 200 lines of code. The wrapper around the weights does more visible work than the weights themselves.
There’s a real exception. For frontier reasoning, novel domains, and long-horizon work where the agent has to invent its own evaluation as it goes, the model still matters more than the harness. The harness amplifies signal the model has to generate first. For the workloads most teams ship this quarter, the harness is the bigger variable.
The first gap map didn’t have a row for any of it.
The most open part of this story. (It isn’t a product.)
Same rubric as last time: 1 means the open version barely exists, and 5 means closed has nothing on us. Cell-level scores in the spreadsheet. (Same disclaimer: It’s almost certainly wrong in places.)
The headline: Agent-layer enterprise readiness lands at almost exactly the number the first gap map produced for the whole stack. The shape that ate the open-source story ate this layer, too, one level up.
Rows 45 through 50 show the failure.
The components are in good shape. Pick any framework, any open harness, any standard underneath— you can build with it today. LangGraph and AutoGen on the framework side. Cline and OpenHands on the harness side. Underneath them, MCP wiring the tools and A2A wiring the agents. Quietly, the protocol layer scores higher than anything sitting on top of it. The most open part of this story isn’t a product. It’s the wiring.
Then the seams. A task you start in LangGraph cannot be resumed in Semantic Kernel. Each framework keeps its own state its own way, and none of those states travel. You aren’t picking a runtime when you pick an agent framework. You’re picking a one-way door.
The permission model is broken. It’s the lowest-scored row anywhere in this exercise. No shared standard for what an agent is allowed to do, what needs your approval, what’s forbidden, how it gets logged. Every harness solves this differently, and none of those solutions move. Harbor and W3C Verifiable Credentials sketch what a portable answer could look like. But nobody has built it.
A model migration is an afternoon. A harness migration is a rewrite. Tool integrations, memory schemas, permission decisions, prompts — all coupled to the runtime you picked. Every team picking a harness this quarter is making a multi-year decision and probably treating it like a tooling decision.
The wrapper around the wrapper
In the last post, I made the Linux comparison: In 2002 the kernel was ready, the enterprise wrapper wasn’t. Red Hat, SUSE, and Canonical built the wrapper. Sun didn’t ship in time, and you know how that story ends.
The agent layer is the wrapper around the wrapper. The model and the inference layer underneath it are in good shape in the open. The substrate above them — the part that turns a model from a function call into an agent — is what the community hasn’t named, hasn’t standardized, hasn’t documented, and hasn’t built a permission model for.
Here’s the work that needs to be done:
A portable permission model. The paper alongside this post names this as the largest single openness gap at the agent layer. There is no shared standard. Please build one.
Interoperability between runtimes. The seven core objects in the paper — agent manifest, tool descriptor, task, message, artifact, memory item, run/checkpoint — exist under different names in every implementation. Agents in different frameworks can’t hand off work, query each other’s memory, or coordinate. Those names are a starting point, not a contract — yet.
Agent manifests and tool descriptors with the same rigor model cards have. What each agent claims, requires, and permits; what each tool costs, risks, and returns.
If we have two more years in which the open agent layer is a side project, the protocols won’t matter. The harnesses doing the work will be closed. The permission decisions will live in someone else’s config file. The runtime your team picked will be the runtime your team is stuck with. Open at the model layer can’t fix open stopping at the model layer.
…and how you can help
The new spreadsheet is bound to be wrong in places. If you’re building one of these harnesses or evaluating one for a team, tell me where the scores are off: @raffihack on X, in the comments, or DM. I’ll update.
Then find a red row you know how to fix. The permission model, the runtime seams, the manifest layer — that’s where the assignments are. If you’re picking one up, talk to me. I’ll feature what you’re building.
The white paper goes deeper if you want the framework. I’m taking edits.
Let’s ship the wrapper.



