DIY Daily

How to turn your self-hosted agent into a private news briefing.

May 06, 2026

It’s 6:30 a.m. My phone buzzes on the kitchen counter. I pick it up and see three short paragraphs, plus five bullets, one possible newsletter angle, and one contrarian take, followed by source links at the bottom — just the way I told my model server I wanted them last week. Read time: 90 seconds, between the first sip of coffee and the second.

The thing that wrote it isn’t on my laptop. My laptop is off. It lives on a small Linux box in a closet, and it’s been awake since I went to bed, reading the news for me on a beat I actually care about. By the time I’m pouring coffee, its reading is done. Some of what it delivers will end up shaping what I write for you: the leads I chase, the links I follow, the threads worth pulling on. That’s the recursion, and it’s part of what this post is about.

The briefing isn’t the only thing the box does. A contractor working on the house sends invoices in whatever format he opens that morning — PDF, scanned photo, sometimes just numbers typed into the body of the email. I connected it to Gmail and a Google Sheet, told it which sender to watch, and asked it to pull line items and append them. The sheet now updates itself. I look at it on Sunday afternoons.

I also keep a running list of restaurants people tell me to try, sorted by city, and the list lives in its memory. When I land somewhere, I ask, and it tells me what I told it last time someone mentioned the place. A hosted assistant could do all of this in theory. None of the ones I use do, because the memory isn’t the product — the chat is.

A month ago, I had you turn your Mac into a model server — llama.cpp. The marginal cost of the next token was zero, assuming you don’t count battery drain or the fact that your thighs were now doing some of the thermal work. Then you closed your laptop, and the model went to sleep with it. Last month we made the model fast. This month we give it somewhere to be when you’re asleep.

Don’t self-host an agent to get cheaper chat. Self-host one to get something that doesn’t sleep when you do. Think of the first useful self-hosted agent not as an AI friend, but as an AI beat reporter. By the end of this post, you’ll have one — even if you skip every command and just read along.

What an agent actually is (and why ChatGPT isn’t quite one)

If you’ve used Cursor or OpenCode, you’ve used an agent — a model in a loop (or, maybe more honestly, a model in a while True: loop with a break condition the model itself decides on). It’s called the ReAct pattern: read the situation, decide what to do, do it (using tools), look at what happened, go again. Coding agents run that loop with a narrow toolbelt: read code, edit code, and run tests. A general-purpose agent runs the same loop with a wider one: search the web, draft an email, and ping you on Telegram when the build breaks.

When you use ChatGPT, you are the agent. You read its answer, decide if it’s right, ask the next question, and paste in the error message yourself. A real agent is the loop. That’s the whole gap between asking ChatGPT to summarize what’s new in local AI every morning and waking up to a briefing already in your phone, generated while you slept.

A note before we go further: An agent that can act can act wrong. An agent with shell access can rm -rf something it shouldn’t. A wrong answer is a waste of tokens. A wrong action is a waste of trust. An agent reading attacker-controlled web pages can be talked into doing things you never asked for — that’s what people call prompt injection, and it’s part of the threat model the field is still figuring out.

We are at the very beginning of figuring out what permissions for agents should look like: what gets done without asking, what requires confirmation, and what’s flat-out forbidden. You can take a look at my small proposal, called Harbor, which is currently more conversation-starter than product, and still needs work. The rest of this post is, in part, what “be careful” looks like at the level of one practitioner with a laptop: sandbox the agent, narrow what it can touch, lock the front door to one person, and verify the boundaries by hand. From my perspective, the interesting question with agents isn’t “Can it do this?” — it’s “What happens if it does the wrong version of this?” The answer should be: not much.

Some disclosures…and which agent we’re going to use

A few things to put on the table before I tell you to install anything.

First: I’m the CTO of Mozilla, and there are two Mozilla projects already alive in this neighborhood. Thunderbolt, out of MZLA, is the open-source self-hostable AI client for organizations that want a sovereign stack — the on-prem version of this conversation, scaled up. Octonous, out of Mozilla.ai, is the agent product for people who want clear scope and approvals before any workflow. Both deserve their own walkthroughs and will probably get them — just not today.

Second: Mozilla is also an investor in Nous Research, whose agent runtime, Hermes, is what I’m about to recommend. We saw the work first; the check was sent after. Tell me I’m wrong on the merits, not on the affiliation.

Third — and this is the one the comments section is going to want to fight about: OpenClaw. It’s the agent that went viral last winter — Hard Fork did an episode, Anthropic’s lawyers forced a rename, its creator joined OpenAI in February — and it is, fairly, the moment “AI agents” became a category in the public mind.

While it’s not where we’re starting today, there is a security story worth flagging: One of OpenClaw’s own maintainers warned on Discord that the project is “far too dangerous” for users who can’t run a command line, and the foundation is in the middle of figuring out who steers it now that the maintainer is at OpenAI. But that’s not really why. The deeper reason is that Hermes is built to become yours. The memory is a file you can cat. The skills are a directory you can ls. The model can be swapped without rewriting anything. The whole arrangement can be picked up off one machine and put down on another. I’m interested in things built for owners, not renters, and Hermes is the project that lets you own this one.

And Hermes is open source — which matters even more for an agent than it does for a model. A model that hallucinates wastes some tokens. An agent that hallucinates takes an action you didn’t ask for, and you find out after. Software that acts on your behalf should be software you can read. Not “audit” in the compliance sense — cat the file and see what it’s about to do. You don’t have to read it. You have to be able to.

What you’re building

Five layers, in this order: the local model endpoint from last issue (or its faster cousin, oMLX, which I’ll explain in a minute), Hermes on top, Telegram as the interface (locked to your numeric user ID and nobody else’s), one recurring job worth doing, and eventually an Ubuntu box that keeps running it after you close the laptop. The first three live on your Mac. The fourth is what makes the fifth worth doing.

The shape of the rest of this post is five steps. Steps 1 through 5 get you a working agent on your Mac, end to end. In a future issue, we’ll do step 6 — the upgrade — moving it onto a server so it keeps working after you close the lid. If you stop after Step 5, you still have something useful.

Step 1: A model endpoint worth pointing an agent at

If you set up llama.cpp last month, you can skip the install, but maybe read the next two paragraphs anyway.

Coding agents — and agent workloads in general — don’t send one prompt and walk away. They send dozens of requests in quick succession, and every request has to ship the entire conversation so far: the system prompt, the tool definitions, the codebase you handed it, every previous turn, plus whatever’s new. Imagine ordering at a restaurant where the waiter has to re-read the entire menu out loud, top to bottom, before you’re allowed to say “and a side of fries.” That’s what the model is doing on every turn.

The thing that normally saves you from this is the KV cache, the model’s running notes on the earlier part of the conversation, kept in memory between calls so it can skip ahead instead of re-reading. But the cache only works if the earlier part — what people call the prefix, basically everything before the latest message — is exactly the same as last time. Add a new file to the context, edit a tool result, change a single token near the top, and the cache is invalidated. The waiter starts the menu over from page one. A few turns in, you’re watching a spinner for 30 to 90 seconds while it re-reads what it already knew. The model used to be fast. Then the agent loop blew through the cache.

oMLX is a native macOS inference server built on MLX, Apple’s own ML framework, and its headline feature is paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix comes back, it’s restored from disk instead of being recomputed — the waiter remembers your menu from yesterday. The project’s own numbers match what I see on an M4 Max: time-to-first-token drops from 30–90 seconds to 1–3 seconds on long contexts in the second-or-later turn of an agent session. That is the difference between “local agent I gave up on” and “local agent I actually use.”

Setup is undramatic. Download the DMG from the release page, drag to Applications, set the port to 8000 and the API key to localdev. For the model, take the boring default: mlx-community/Qwen3.5-9B-MLX-4bit. We can optimize or choose something more interesting later.

Step 2: Hermes, and the first boundary

Install Hermes:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.zshrc

That gets you the hermes command. The next thing is to tell Hermes which model to use. The way you do that is hermes model, which is an interactive walkthrough — not a flag-festival, and not a config file you have to memorize. It opens a menu of providers; pick “Custom endpoint (self-hosted / VLLM / etc.),” and then it asks for four things in turn:

URL: http://127.0.0.1:8000/v1
API key: localdev (the placeholder you set in oMLX)
Model: Qwen3.5-9B-MLX-4bit
Context length: 131072 — half of what the model can natively handle, which is plenty for an agent that hasn’t earned a longer leash yet. (Hermes refuses to start with anything under 64K by design. Agents need working memory.)

Now, before the first prompt, let’s think about the boundary. By default, Hermes will execute its tool calls, which include shell commands, directly on your computer. But, on a laptop full of credentials, source code, and a browser logged into half the internet, that seems crazy. Change the terminal backend so the agent runs its tool calls inside a Docker container instead of directly on your machine. If you don’t already have Docker on the Mac, install Docker Desktop (or brew install --cask docker if you’d rather get there from the terminal). Then:

hermes config set terminal.backend docker

It is a cheap and imperfect boundary, but it is definitely better than no boundary. It is the right default for a tool you’re learning. You may get annoyed early because the agent can’t see some files you can. But that’s a feature and not a bug! That’s the boundary working, not the agent failing.

Now run hermes and ask it something easy to verify:

Summarize the README in this directory.

If you see it tool-call its way through reading the file and hand back a summary, you’re past the hardest part. From here, everything else is shaping the system around the conversation it can already have.

Step 3: Telegram, locked to you

You need a way to talk to this thing while out and about, and without sitting in front of the terminal. The candidates are many, but, for me, it’s Slack, email, and Telegram. Slack was wrong because I don’t enjoy having to wait for the IT team to approve things for me. That’s the right thing for “production,” and absolutely none of it is correct for an evening of tinkering. Email had its own complications. Telegram, by a wide margin, is the fastest path from no bot to bot replying to my phone. (If you don’t use Telegram day-to-day, that’s fine — you don’t need to. Install the app, talk to one bot, never open it again.)

The mechanics take 15 minutes. Talk to BotFather, get a bot token. Get your own numeric Telegram user ID — @userinfobot will tell you. Then hermes gateway setup walks you through wiring those two values in. The values land in ~/.hermes/.env as:

TELEGRAM_BOT_TOKEN=<your-bot-token>
TELEGRAM_ALLOWED_USERS=<your-numeric-user-id>

That last line is a lot of the protection. Without it, anybody could, theoretically, message your bot — and now a stranger is sending a prompt that runs against your agent with shell access to your container on your laptop. With it, the gateway will refuse messages from anyone whose numeric ID isn’t on the list. (If you ever want to add another person, it’s a comma-separated list. Do not add another person on day one.)

I deliberately didn’t install Hermes as a launchd service yet. launchd is the macOS process that runs programs in the background and restarts them when they crash, so they keep running even when you’re not looking at them. When you’re tinkering, you don’t want that. I wanted to see the gateway receive messages, reply, and fail in plain sight before I tucked it away into the operating system, which it promptly did: My first hermes gateway run failed with database is locked on the local SQLite file, because another Hermes process was still holding it from earlier. This is the unglamorous truth of self-hosting: a real chunk of the work is debugging operational plumbing — sockets, ports, file permissions, processes that won’t release a file — none of which is slick, and none of which is about AI. Most of the work in self-hosting isn’t AI. It’s port numbers.

I killed it. Restarted. Sent /start from my phone. Got nothing. (Turns out /start isn’t wired up as a special handler.) Sent hi instead. Got a reply.

So now, my Mac was actually serving a model through an agent to a phone, with exactly one human authorized to talk to it. It was also, if I’m honest, the moment the project felt real. Everything before it was set-up. Everything after this is product work.

Step 4: Give it one job

A bot that replies to hi is a demo. Now it needs a job.

The job I gave Hermes is to create the briefing from the opening of this post: scan the web each morning on a beat I care about — local AI models on consumer hardware, self-hosted agents, the open stack — and write it up as three short paragraphs and five bullets, ending with one possible newsletter angle and one contrarian take. Source links at the bottom. Deliver it to me via Telegram by 6:30 a.m.

Step one, Hermes needs to read the web, which is harder than it sounds, because Google and most search engines actively block automated traffic. Point an agent at google.com and you’ll get CAPTCHAs and rate limits. The workaround is a search/scrape provider that handles the anti-bot work and hands you back clean text. I used Firecrawl, Hermes’ default web backend. Sign up, grab a key, and add it to Hermes’ own environment file (~/.hermes/.env), not your shell’s. The gateway only reads its own .env; dropping FIRECRAWL_API_KEY=... into your terminal session won’t reach the long-running process.

echo ‘FIRECRAWL_API_KEY=fc-your-key-here’ >> ~/.hermes/.env

Then restart the gateway so the new key is in effect. (I forgot that part the first time and spent 10 minutes baffled by why the agent couldn’t search the web.)

A quick note, because this trips people up. There are two surfaces: the Hermes console (the interactive TUI you launch with hermes from a terminal) and Telegram (the chat bot you just wired up). They are not interchangeable. The console is where you teach the agent new things — iterate on prompts, paste in critiques, save skills, watch the tool calls happen in real time. Telegram is the delivery surface — short messages out, short answers back, scheduled briefings showing up in the morning. You could technically teach the agent through Telegram, but you’d hate it. The console is where the work happens; Telegram is where the result shows up.

Here’s the workflow that works: gateway running in one terminal, Hermes console in a second terminal, where I iterate on the briefing prompt by hand until it produces what I want.

Tighter. Cut the editorializing. Lead with the link, not the headline. Stop telling me what’s interesting and just tell me the news.

When the output stops disappointing me, I hand it to Hermes’ built-in scheduler:

hermes cron create “every 1d at 06:30” \
“Run the daily Owners, Not Renters briefing and deliver to Telegram”

That’s it. No system crontab to edit, no separate scheduler to install. Hermes runs the prompt at your scheduled time every day inside a fresh agent session, and the gateway delivers the result to the platform you configured.

It’s pretty neat to wake up to a real briefing — three stories, one of them genuinely new to me, sitting at the top of my Telegram before I’d opened my laptop.

Step 5: The part where it starts to become yours

A few days in, I had opinions. The summaries were too broad. I wanted more on self-hosted agents, less on every new model release. I wanted five bullets, not three paragraphs. I wanted a single contrarian takeaway in one sentence at the end, with no preamble. I told Hermes that in the middle of a console session:

Too broad. Focus more on self-hosted agents and local models. Five bullets. End with one newsletter angle and one contrarian takeaway. Remember this format for future briefings.

Then:

Save this process as a skill called owners-briefing.

The next morning’s briefing was in the new shape. The morning after, too. And the morning after that. I had opened a session, complained about the output, and the complaint had stuck.

That’s also the moment to upgrade the cron job. Instead of stuffing the format into the prompt every time, you point cron at the saved skill:

hermes cron edit <job_id> --skill owners-briefing

You can edit the schedule, the prompt, and the format independently. Tomorrow, you might decide the briefing should run twice a day, so you change the schedule. Next week, you might add Slack delivery; that’s a separate flag. And the format you taught lives in the skill, ready to be reused by anything else you want to give it to.

That is the upgrade. It’s the thing that is genuinely hard to get from a hosted assistant whose memory belongs to someone else’s product roadmap.

What you have, and what’s next

What you have at the end of Step 5 is a real thing: a model on your machine, an agent running on top of it, a chat surface locked to one human, and one recurring job that gets better when you teach it. It runs while your laptop is open. It learns when you correct it. It belongs to you, not to a hyperscaler. That is — genuinely — most of the value.

What you don’t have yet is permanence. Close the lid and the briefing stops. Travel and the appointment goes unkept. In a future post, I’ll show you how to move this whole stack onto a small Linux box that stays awake when you don’t, without losing any of the boundaries we just spent a thousand words putting up.

For now: Tell me what you’d give it. What’s the one job that would actually show up in your morning if it just ran — without you remembering to ask, without you opening a tab? Reply, or drop it in the comments. The more concrete jobs I have to work from, the better the next post is going to be.

Two things to take with you

Treat your agent like a new hire. If you hired a personal assistant, you wouldn’t hand them your password manager, your bank login, and the keys to your house on day one. You’d give them your calendar, maybe a corporate card with a low limit, and you’d extend their access as you watched them work. Security folks call this least privilege. It’s the right posture for a software agent, too — and it’s not the default posture you get when you install one. Every step in this post was a chance to give the agent more access than it needed: the Docker boundary, the Telegram allowlist, the model running locally rather than wired to half the internet. The point isn’t that any one of those is doing the security work — it’s that you have to be paranoid at every step. That’s the cost of participating in bleeding-edge stuff: the muscle memory doesn’t exist yet, so you do the thinking out loud, every time, until it does.

Give it one job, not ten. The pitch you’ll hear from most of today’s agent software is that it can do everything: read your email, manage your calendar, post on your behalf, run your day. The temptation on day one is to let it. This is exactly how people end up telling agent horror stories at dinner parties. Pick the smallest workflow that would be useful if it ran perfectly, get it working, watch it for a week, and only then start thinking about a second. The briefing didn’t get good because the agent could do many things; it got good because there was exactly one thing to learn the shape of, and one thing to tune. With seven things going at once, you don’t get to see what your agent actually does — you just get a vague sense of whether the day went okay. With one, you start to know it. One job. One feedback loop. Earn the second job by being unsurprised by the first.

Discussion about this post

Ready for more?