LaunchLab Fleet · Swapnil Surdi

The problem

I run a ~22-container homelab on a recycled Windows laptop — photos (~120k assets), media, network-wide ad-blocking, private search, local LLMs, and a self-hosted Matrix server — plus two more recycled laptops in two countries doing household duty. Nothing is public, nothing phones a vendor, and everything runs on hardware I own.

Three machines across time zones is an ops job. I didn’t want Kubernetes, management GUIs, or a pager. I wanted each machine to maintain itself, the fleet to coordinate without me, and a human paged only when something genuinely needs hands. So each laptop is operated by its own headless Claude Code agent; most of the architecture exists to make that safe and affordable.

The architecture

The access plane. Tailscale puts every machine on a private overlay. A self-hosted resolver answers *.surdi.in inside the tailnet; one nginx terminates a wildcard Let’s Encrypt cert (DNS-01) and reverse-proxies by container name. Every service becomes a clean HTTPS URL — status.surdi.in, not an IP and a port — and none of it touches the public internet. Every app port is loopback-bound, nginx is the only tailnet-facing surface, and the Matrix server is network-sealed — no NAT path out at all.

The agents. Each laptop runs its own headless Claude Code process (claude -p): the gatekeeper of exactly one machine — its own — with root on itself and nothing else. Siblings reach each other only over Tailscale SSH: keyless, identity-based, no authorized_keys files. The harness is the agent — no framework on top.

The agent bus is a Matrix room. The agents share a private #ops room on the self-hosted server. The addressing model is the heart of it: in a DM, everything is for the agent; in a shared room, an agent answers only when addressed by name — how three agents share one room without all three replying. Hand-offs need no RPC: an agent delegates by posting to #ops addressed to a sibling. The room is the bus. The invoked agent reacts on your message as a live status line — 👀 seen, 👍 working, ✅ done — so replies never waste tokens narrating “on it.”

No webhooks: each box long-polls Matrix /sync over loopback, journals every message to disk before acting (crash-safe, at-least-once), and runs a write-path canary — added after a real bug kept reads green while every send silently failed for 50 minutes.

Telemetry is not chat. Machine data goes to a status hub I wrote in Go on SQLite (WAL): one ~1,400-line file shipped as a pure-Go static binary in a distroless image, dashboard embedded. Every box posts host and per-container telemetry every 60 seconds; identity is derived from the bearer token, never trusted from the body. An incident stream rides on top: every failure lands in a durable per-day JSONL archive before the database, then dedupes by fingerprint over a 6-hour window — a once-per-tick failure collapses into one ×N row, not a flood.

The watchdog spends zero tokens. Every 5 minutes: 7 deterministic checks — expected containers, container health, key endpoints, DNS, disk, media mounts, chat-listener liveness. 288 runs a day, zero model calls on a healthy day. Alerting is chat-first and latched: 3 pings max per incident, then silence until recovery re-arms it. The model is invoked only when a problem exists and chat delivery provably failed and no email has gone out yet — a doubly-gated escalation.

The loop closes nightly. A 9 PM digest pulls the day’s incidents into one claude -p run that triages each open one and applies a fix only from a strict safe-allowlist: restart an exited container, re-seed an expired credential, clear a stale latch, reload a wedged service. It never edits config, touches data, rotates secrets, or spends money — anything bigger stays open, diagnosed, and flagged. Then it sends me one email. The incidents streamed all day are what the evening pass remediates and reports — fixed before reported.

Provisioning is one command. New boxes image from a USB stick; a single firstboot script installs the rootless-Podman runtime, hardens it (firewall only on the tailnet interface, password auth off, unattended upgrades), joins the tailnet, and enrolls TPM2 LUKS auto-unlock bound to PCR7 — zero-touch boot for a headless appliance, accepting that a stolen powered-on laptop is reachable while a stolen disk stays encrypted.

Decisions that mattered

The model is the exception, not the loop. The anti-pattern I tore out early: a timer unconditionally piping a prompt into claude -p. The rule that replaced it fronts every scheduled job — cheap deterministic checks every tick, a model session only on a real signal: a message, an anomaly, a due reminder. It’s what makes a 24/7 fleet affordable on a subscription.

Two channels for two jobs. Telemetry goes to the Go hub; conversation goes to Matrix. Each stays simple by doing one thing.

Bounded, reversible autonomy. Agents run unattended with permission prompts skipped, so every action must be safe to repeat and never destructive — the most-repeated sentence in the codebase. The nightly pass fixes from an allowlist and leaves everything else open: a wrong unattended change at night is worse than an unfixed incident. The hard nevers are what let me sleep through round-the-clock autonomy.

Numbers

The status hub produced the best war story. /api/hosts had quietly degraded to 4.18 s: each host card ran COUNT(*) FROM containers WHERE report_id = ? against a table grown to ~258k rows (7 days × 60s beats × ~19 containers) with no index on report_id — a full scan per card, growing every day. One index turned each count into a B-tree lookup: 4.18 s → ~18 ms, ~160× — and, more importantly, decoupled latency from data volume. It went live under WAL with zero downtime, then into the embedded schema so rebuilds keep it.

The rest: 3 nodes, 22 containers, 60-second telemetry, 7 checks × 288 watchdog runs/day at zero tokens, 6-hour incident dedup, 30-day pruning (JSONL archive never pruned), 3 alert pings max per incident, one email a day.

What I’d steal from this design

A private overlay + wildcard DNS + one reverse proxy turns a pile of containers into port-less HTTPS URLs with zero public exposure.
Make the model the exception, not the loop.
Separate telemetry from conversation.
Let operational scars become architecture — the 160× index, the message journal, and the write-path canary were each born from one specific failure, then generalized so it can’t bite twice.
Bound autonomy explicitly. An allowlist you trust beats a general-purpose agent you don’t.