Closing the Open-Source Gap

I scored 42 subcategories of the open-source AI stack. The models aren't the problem.

Apr 13, 2026

What you're looking at is the open-source AI stack — every layer of it, from hardware and chips up through model weights, datasets, developer tools, documentation, licensing, and safeguards, scored against its closed-source equivalent. The breakdown comes from Unpacking Open Source Artificial Intelligence: Towards a Framework for Openness in Foundation Models. I pointed agents at each one, had them read real repos, compare them to closed-source equivalents, and score across 10 criteria. (If you're curious, hit me up, and maybe I'll package up the code.) If you're actually trying to piece together an open-source AI deployment — stitching inference servers to tooling to compliance — you're running into two problems at once: what's blocking you today, and what nobody's building for tomorrow. This is a map of both.

The headline: Average maturity is 3.2 out of 5. Enterprise readiness averages 2.3.

Works in demos. Dies in procurement. Maybe that’s why every engineering team that would prefer to own their inference is defaulting to closed APIs: Not because the models are worse, but because nobody’s staking a production SLA on a GitHub repo with no support contract. The preference exists. The plumbing doesn’t.

Only one subcategory hits a 4 on enterprise readiness: ML frameworks — PyTorch, TensorFlow, JAX — the things that have had a decade of production hardening. Everything else? No uptime guarantees. No VPC isolation. No AD integration. Closed platforms don’t win on the model. They win on the contract.

The developer experience is the other gap. Integrating the OpenAI API involves five-ish lines of code. Deploying the open-source equivalent? Docker, reverse proxies, manual model management. Teams with a six-week ship date, of course, are going to pick the thing that works today. And every team that makes that call — reasonably, defensibly — translates to another year of lock-in that gets harder to unwind later. (And more expensive when the pricing changes, as 135,000 OpenClaw users just learned.)

Licensing scored the lowest in the entire spreadsheet. There are no templates that address model output ownership, training data rights, or prompt injection liability. Every AI startup either adopts a closed platform’s terms wholesale or pays $50K+ for custom legal drafting. That’s not a technical gap. It’s an institutional vacuum.

Of course, this spreadsheet is almost certainly wrong in places. If you’re deep in any of these layers and you see something I missed, tell me: @raffihack on X, or comment below.

The Bright Spots Are Real

Deployment, inference code, base weights, and training code all scored 4 on overall maturity. vLLM, llama.cpp, HuggingFace TGI are production-grade today. Strong tools, no wrapper.

The models aren’t the problem.

Open-weight models now trail the closed frontier by roughly three months. That’s Epoch AI’s Capabilities Index talking — a composite score across 37 benchmarks, updated continuously, built in collaboration with Google DeepMind. On MMLU specifically, the gap collapsed from 17.5 percentage points to under 1 in about a year. DeepSeek-R1 matched o1-class reasoning at a reported training cost of $6M. Qwen3.5 scores 88.4 on GPQA Diamond, ahead of everything except the most expensive closed options. On a single gaming GPU, you can run open-weight models matching frontier performance from nine months ago.

Three months behind on capabilities. And closing.

Even Mythos — the model Anthropic said was too dangerous to ship — isn’t an argument against open-source models being smart enough. AISLE researchers replicated several of its headline vulnerability findings with openly available models. Alex Stamos — Stanford’s former internet security chief, ex-CISO at Facebook and Yahoo — estimates six months before open-weight models close the rest of that gap.

But three months isn’t a law of physics. It’s a snapshot. Nathan Lambert makes a fair counterpoint: as frontier labs push into domains that aren’t on the public web — proprietary environments, long-horizon agentic work, specialized RL pipelines — the gap could widen again.

Which means two things need to be true at once. We need people working on models — the work DeepSeek, Qwen, and Meta are doing matters in a world where every month of lag is a month where someone signs a closed contract instead. But we also need people investing in the rest of the stack, because even when the models are competitive, teams are still defaulting to closed platforms. Not because the model is worse. Because the model is the only part that’s ready.

Deloitte’s 2025 survey found that nearly 60% named legacy integration and risk/compliance — not model quality — as their top barriers to deploying agentic AI. The 2026 follow-up (3,235 leaders, 24 countries) put insufficient worker skills at the top of the list, with legacy data and infrastructure right behind. Nobody in procurement is blocking on benchmarks.

The capability gap is a “months” problem. The packaging gap is a “years” problem. And right now, almost all the energy is going to the part that’s already closest to parity.

Enterprises are signing multi-year contracts, building integrations, and training teams on specific APIs. Every quarter that open source doesn’t have a credible enterprise story, those defaults harden. Procurement doesn’t re-evaluate because a benchmark improved. It re-evaluates when something is obviously easier, obviously cheaper, and obviously supported.

We’ve seen this movie before. In 2002, Linux had the kernel but not the enterprise story: no certifications, no vendor support, no one willing to put it in the data center with a pager attached. Red Hat, SUSE, and Canonical didn’t build a better kernel. They built the wrapper. They made it adoptable. That’s what turned an ideology into an industry. Ask Sun how not shipping the wrapper in time worked out.

The open-source AI stack is at that same inflection point. What’s missing is the boring stuff: the SLAs, the compliance tooling, the five-line integration. That’s not a research problem. It’s an engineering and institutional problem. And those are the kinds of problems this community has solved before.

We’re not losing on the model. We’re losing on the paperwork.

Look at the gap map. Find a red zone you know how to fix. Maybe that’s compliance tooling. Maybe it’s a five-line SDK. Maybe it’s the ToS template that doesn’t exist yet. Maybe you’re training the next open-weight model that keeps the three-month number from slipping. All of it counts.

If you’re building in any part of this stack — models, infrastructure, tooling, licensing, documentation — I want to hear from you. Reply here, DM me, yell at @raffihack. I’ll feature what you’re working on.Let’s close this gap.

What I’m Reading This Week

Anthropic built a model that found zero-days in every major OS and browser — a 27-year-old OpenBSD bug that survived five million automated scans — and then didn’t ship it. Instead: Project Glasswing. $100M in credits to a consortium (AWS, Apple, Microsoft, Cisco, Linux Foundation) to harden infrastructure before Mythos-class capabilities proliferate. The technical writeup is worth your time. Notable absence from the consortium: the US government, which banned Claude from federal agencies. I've been thinking a lot about what this means for the people who weren't on the partner list.

Full disclosure: Nous Research is a Mozilla VC company, but I would highlight them anyway. Hermes Agent is their answer to the question nobody at the closed platforms wants you to ask: What if the agent itself were open? Think OpenClaw, but with autonomous skill creation, cross-session memory, six terminal backends (including serverless that hibernates when idle), and an RL pipeline for fine-tuning tool-calling models on your own trajectories. It swaps providers with zero code changes. MIT licensed, runs on a $5 VPS. The agent-as-a-service pricing model looks a lot less inevitable when the alternative is curl | bash

Chardet, the Python character-encoding library with hundreds of millions of annual downloads, got a Claude Code-powered ground-up rewrite and a license change from LGPL to MIT. Mark Pilgrim, the original author who deleted his entire online presence in 2011, came back specifically to file the issue: exposure to the original codebase means no clean-room defense, and an AI rewrite doesn’t change that. The implications go way past one library: if a court ever rules that AI output is a derivative work of its training corpus, a lot of commercial codebases are carrying copyleft obligations nobody’s accounted for.

Alibaba AI is truly unhinged. (Via Ben Dickson at TechTalks.)

Andrej Karpathy’s — former Tesla AI lead, OpenAI founding team — latest obsession: using LLMs to compile personal knowledge bases. Dump source documents into a raw/ directory, have a model incrementally build a wiki of interlinked markdown files with summaries, backlinks, and concept articles. No database, no custom tooling — just .md and .png files with an AGENTS.md schema. I’m going to bumble my way through building one and write up what actually works.

Carnegie Mellon is releasing a paper on the problem everyone building with coding agents is about to hit: What happens when you need multiple agents working on the same codebase simultaneously? Their system (CAID) uses git worktrees for isolation, git merge for integration, and dependency graphs for task ordering — the same primitives human dev teams already rely on. Key finding: giving a single agent more iterations doesn’t help, but splitting work across coordinated agents does (+14% on library-from-scratch tasks, +27% on paper reproduction). I want to try implementing this.

My good friend Ryan Sarver wrote this — over a million views and counting. He built an AI chief of staff on OpenClaw out of markdown files, Python scripts, and a $20/month subscription. No SaaS, no vendor, no contract. Flat files he owns, backed up to git. It tracks his fundraise pipeline, preps him before every meeting, extracts action items after, and improves itself every week. The title says “better than any human I’ve hired.” Ryan hired me. I’m choosing not to take that personally. But this is the quote that matters: “Open source is an unmatched engine for turning community ideas and energy into progress. A week from idea to full native integration. Closed source is fast, but open source is something else entirely.” I’m building my own version and will write it up for you soon.

Button is a $180 AI pin from ex-Apple Vision Pro engineers that BLE-tethers to your iPhone and proxies every utterance to an unspecified cloud LLM. It might run something on-device. Nobody knows, including, apparently, the press. Eight bucks a month for unspecified Pro features. No Android. I keep waiting for AI hardware to be more than Siri in a lanyard, and it keeps shipping as a $180 curl to someone else’s API. My friend Ayah Bdeir nailed the structural reason in MIT Tech Review: when the hardware layer is closed, every device converges on the same interaction model — press, talk, wait for cloud. You can’t iterate on form factor if you can’t touch the board. She’s now CEO of Current AI, which just demoed an open-source handheld at the India AI Summit — local inference, no cloud round-trip, 22 languages via Bhashini, full schematics going to GitHub. We talk about open source like it’s a software problem. The hardware layer is just as locked, and it’s why every AI device on the market feels exactly the same.

Anthropic told every tool built on top of Claude: Use our API and pay per token or lose access. Starting April 4, subscribers can no longer route their subscription through third-party agent harnesses like OpenClaw. The stated reason is capacity: A single OpenClaw user consumes 6-8x the resources of a human subscriber, and subscriptions weren’t built for agentic workloads. The unstated context? OpenClaw’s creator had just joined OpenAI, and Anthropic had just shipped competing features into Claude Code. If you needed a concrete example of what platform dependency risk looks like in the AI stack, this is it. 135,000 active instances woke up one Saturday to find their cost structure had changed overnight. Some users are reporting 50x increases. Others are switching to local models.

Google Research open-sourced TimesFM, a foundation model for time-series forecasting — not “what word comes next?” but “what number comes next?,” applied to server load, revenue, error rates, stock prices, energy consumption. 200M parameters, 16k context length, PyTorch or JAX, Apache 2.0. Most teams doing this work are still hand-rolling statistical models or paying for a proprietary API.

Finally, file under DO NOT GET ME STARTED…: The White House proposed slashing NSF’s budget by 55% to $4 billion while claiming it will “maintain funding” for AI and quantum research. Read the fine print: basic AI research at NSF would be cut 32%, basic quantum 37%. The “maintained” funding goes to applied research at Defense and Energy. So the plan is to defund the pipeline that produces the science and then act surprised when there’s nothing left to apply. If you’re wondering why ownership of the AI stack matters, this is the policy environment you’re building in: The public funding for foundational research is being gutted while the administration tells you everything is fine because DARPA got a raise.

Emanuel Maceira

Apr 13

The Linux analogy is spot on. From the telco side, I’ve watched this exact pattern play out with open-source network stacks — the technology was ready years before the enterprise wrapper caught up. Same story here: open-weight models at 3-month lag is incredible, but enterprise readiness at 2.3/5 is the real scoreboard. The licensing vacuum is especially brutal for anyone deploying AI at the edge or in IoT environments where model output ownership and liability get murky fast across jurisdictions. Whoever builds the Red Hat of open-source AI — the SLAs, the compliance layer, the boring but essential plumbing — will own the next decade. Phenomenal breakdown Raffi 📊

Discussion about this post

Ready for more?