Build A Harness GitHub
CASE STUDY · THREE OF THE MOST-USED OPEN-SOURCE AI AGENTS

Three popular agents.
One thin harness, three ways.

Hermes Agent, Kilo Code, and OpenClaw all run the same commodity ReAct loop — each project's own documentation says so. Mapped against the eleven-layer harness architecture, here's what each one actually built around that loop, and what all three still leave thin. Read the loop-first version on Design a Loop →

8/11 harness layers where none of the three agents reaches full coverage
3 layers where any agent reaches full coverage at all — Execution (Kilo), Memory and Learning (both Hermes)
0/3 ship a formal output/reviewer-pass quality gate before a reply goes out

Three agents, three very different scales.

Each project reports its own headline numbers in its own units — stars, users, tokens — so the cards below show each on its own terms.

Hermes Agent
MakerNous Research
LaunchedFeb 2026
LicenseMIT
GitHub stars~140,000 (in 3 mo.)
OpenRouter volume770B tokens/day
Primary interfaceCLI · messaging · IDE · API
Kilo Code
MakerKilo (fork of Roo/Cline)
LaunchedMar 2025
LicenseOpen source + hosted gateway
Users1.5M+ (3M+ "Kilo Coders")
Tokens processed40T+ (cumulative)
OpenRouter volume235B tokens/day
Primary interfaceVS Code · JetBrains · CLI · mobile
OpenClaw
MakerPeter Steinberger
LaunchedNov 2025 (as Clawdbot)
LicenseMIT
GitHub stars~247,000
OpenRouter volume161B tokens/day
Primary interfaceMessaging apps · device nodes
OpenRouter daily token volume — global app rankings
Hermes Agent
770B/day
Kilo Code
235B/day
OpenClaw
161B/day
Live figures from OpenRouter's global app rankings (openrouter.ai/apps), accessed July 2026 — all three now report through OpenRouter, correcting an earlier assumption on this page that Kilo Code was gateway-only. Hermes overtook OpenClaw on this metric back in mid-May 2026 and has grown substantially since. For context, Claude Code (not one of the three agents covered here) ranks between Kilo Code and OpenClaw at 218B/day.

Mapped against the eleven-layer harness.

Each agent's documented behavior mapped against the same eleven canonical layers used across Build A Harness — not an ad hoc feature list.

present — a real, documented mechanism partial — an ad hoc or static substitute not described in public docs
Harness layer Hermes Agent Kilo Code OpenClaw
1. Caller State Not described Agent "modes" (code/ask/plan/debug) fix tools + instructions per task Messaging channel/persona selects context — not a formal constraint system
2. World Model Raw history only Static Memory Bank files Static persona/bootstrap files
3. Reasoning Not described Not described Not described
4. Control State Not described Binary approval gate, not a tiered resolver Queueing prevents races, isn't a risk control
5. Planning Subagents get their own capped budget via delegate_task — no full dependency graph Subagent delegation for isolated subtasks Not described
6. Execution 70+ tools, ~28 toolsets, 6 backends, concurrent dispatch Git-snapshot checkpoints + diff-repair + tool-call translation Broadest action surface (device/OS tools), weakly gated
7. Verification Not described Informal completion check + diff-repair Not described
8. Recovery Provider fallback, fixed iteration budget Checkpoint rollback, diff-repair Auto-compaction retry only
9. Memory SQLite+FTS5, context compression with a 20-message floor Memory Bank, context condensing Raw JSONL transcript, token-limit + compaction reserve
10. Learning Self-improving skills — the project's headline differentiator Not described Explicitly static — no self-improvement
11. Output & Reviewer Pass Not described Not described — human review substitutes Reply shaping is cosmetic, not a quality gate

Reading the pattern: Execution, Memory, and Learning are the only layers where any agent reaches full coverage — Kilo's checkpoints, and Hermes's persistent memory and self-improving skills, respectively. Reasoning and Output & Reviewer Pass get zero coverage across all three. The rest — Caller State, World Model, Control State, Planning, Verification, and Recovery — are partial at best everywhere: never fully absent, never fully built out either. Which is a fair description of where most production agents are today, popular or not.

What it's actually like to run one.

Beyond the layer-by-layer mapping, each project reads differently in practice:

Practical dimension Hermes Agent Kilo Code OpenClaw
Model / provider breadth 18+ providers, open-weight models 500+ models, 60+ providers via the Kilo Gateway Claude, GPT, DeepSeek, or local — OAuth piggyback on existing subscriptions
Primary reach CLI, 20 messaging platforms, IDE (ACP), API, cron VS Code, JetBrains, CLI, cloud agent, mobile WhatsApp, Telegram, Discord, Signal, iMessage, Slack, Teams, native device apps
Business model Free, MIT; runs on a few-dollar-a-month VPS Free/BYOK or Kilo credits, zero markup on provider rates; $8M seed round Free, MIT, self-hosted; foundation stewardship after the founder joined OpenAI
Known risk Newest of the three (Feb 2026) — memory and self-improving skills lack a long track record Parsing reliability is a constant fight across a 500-model matrix — not a novel loop, an inherited one RCE CVE (CVSS 8.8), 1,000+ malicious marketplace skills reported, prompt-injection susceptible; restricted for state use in China
In Build A Harness

All 11 layers ship as drawable, composable canvas nodes — Reasoning, Control State, Verification, and Output & Reviewer Pass included, the four layers thinnest across all three agents above. See the full architecture →

Same commodity loop, underneath all three.

Every layer above wraps around a loop that, on its own, isn't a differentiator. Each project's own documentation says as much — and mapped against the same seven-stage loop, the pattern is clear.

present partial not described
Loop stage Hermes Agent Kilo Code OpenClaw
1. Perceive Raw session history, no typed beliefs Static Memory Bank files Static bootstrap/persona files
2. Reason Not described Not described Not described
3. Decide Provider fallback only Binary human-approval gate Queueing, not risk-tiering
4. Act Concurrent tool dispatch, no risk review Git checkpoint before every edit Broad device/OS action surface
5. Verify Not described Informal check + diff-repair Cosmetic reply shaping only
6. Recover Fallback provider, iteration budget Checkpoint rollback, diff-repair Auto-compaction retry only
7. Learn & Repeat Persistent memory + self-improving skills Memory Bank persists across sessions Static — explicitly no self-improvement
See each agent's full, unsimplified step list
Hermes Agent — 8 steps
  1. Generate a task ID, append the user message
  2. Build or reuse the cached system prompt
  3. Check whether preflight compression is needed (>50% context full)
  4. Build API messages per API mode
  5. Inject ephemeral prompt layers (budget/context warnings)
  6. Apply prompt caching markers (Anthropic)
  7. Make an interruptible API call
  8. Parse response: execute tool calls and loop to step 4, or persist + return
Kilo Code — 7 steps
  1. User gives a task, selects an agent mode (code/ask/plan/debug)
  2. Kilo assembles context: system prompt, auto-found files, @-mentions, Memory Bank
  3. Prompt + tools sent to the configured model via the Kilo Gateway
  4. Model replies with reasoning plus typically one tool call
  5. State-changing actions pause for user approval unless auto-approved
  6. Approved action executes; result captured
  7. If complete, present for review; else append observation and loop to step 3
OpenClaw — 7 steps
  1. Intake — Gateway RPC/CLI validates params, returns {runId, acceptedAt} immediately
  2. Queue & resolve — serialize via per-session + global queues, resolve model/auth
  3. Session & workspace prep — resolve workspace, load skills, inject bootstrap files
  4. Prompt assembly — base + skills + bootstrap context, token limits enforced
  5. Model inference ⇄ tool execution — stream text/tool-call deltas, execute, repeat
  6. Reply shaping — assemble final payload, filter internal tokens
  7. Persist & emit — write transcript to JSONL, emit lifecycle:end
Hermes Agent docs

"On its own, this loop is a fairly conventional ReAct-style tool-calling loop — structurally similar to what Claude Code, OpenClaw, and most other agent harnesses do..."

Kilo Code docs

"On its own, this is a conventional ReAct-style tool-calling loop — structurally the same shape used by Cline, Roo Code, Claude Code, Cursor, and most other agent harnesses."

OpenClaw docs

"On its own, this is a conventional ReAct-style tool-calling loop... The loop mechanics are not OpenClaw's differentiator."

Questions, answered.

Why compare Hermes Agent, Kilo Code, and OpenClaw specifically?

They're three of the most-used open-source AI agents running today, each with public architecture documentation detailed enough to map against a common harness taxonomy — and each one explicitly describes its own loop as conventional in its own docs.

Which of the three has the best harness?

None covers more than 2 of 11 canonical layers at full strength. Hermes leads on Memory and Learning. Kilo Code leads on Execution. OpenClaw leads on reach, but its broad, weakly-gated action surface lines up with its documented security incidents — an RCE CVE and a marketplace incident involving over a thousand malicious skills.

What is the 11-layer harness architecture?

The canonical architecture used across Build A Harness: Caller State, World Model, Reasoning, Control State, Planning, Execution, Verification, Recovery, Memory, Learning, and Output & Reviewer Pass. Full definition →

Where does Build A Harness fit in?

Build A Harness implements all 11 layers as drawable, composable canvas nodes — so you can build the layers these three agents leave thin (Reasoning, Control State, Verification, Output & Reviewer Pass) without writing that infrastructure yourself.

Primary documentation.

This comparison draws on each project's own architecture docs and independent reporting, current as of mid-2026. Token-volume figures are live from OpenRouter's global app rankings, accessed July 2026 — those numbers move daily and will drift from this snapshot.

The layer-by-layer coverage mapping above is our own reading of each project's public documentation against Build A Harness's canonical taxonomy, not an official audit or a claim endorsed by any of the three projects.

Build the layers
these three leave thin.

Reasoning, Control State, Verification, and Output & Reviewer Pass — the layers where all three of the most-used open-source agents are thinnest — ship as drawable canvas nodes in Build A Harness. Apache 2.0. Runs locally via Docker.