CASE STUDY · THREE OF THE MOST-USED OPEN-SOURCE AI AGENTS

Three popular agents.
One thin harness, three ways.

Q: Which of the three has the best harness?

None covers more than 2 of 11 canonical harness layers at full strength. Hermes leads on Memory and Learning (persistent cross-session memory, self-improving skills). Kilo Code leads on Execution (git-based checkpoints, tool-call reliability plumbing). OpenClaw leads on reach (messaging apps, device control), but its broad, weakly-gated action surface lines up with its documented security incidents — an RCE CVE and a marketplace incident involving over a thousand malicious skills.

Hermes Agent, Kilo Code, and OpenClaw all run the same commodity ReAct loop — each project's own documentation says so. Mapped against the eleven-layer harness architecture, here's what each one actually built around that loop, and what all three still leave thin. Read the loop-first version on Design a Loop →

8/11 harness layers where none of the three agents reaches full coverage

3 layers where any agent reaches full coverage at all — Execution (Kilo), Memory and Learning (both Hermes)

0/3 ship a formal output/reviewer-pass quality gate before a reply goes out

See the harness ↓ See the loop ↓ Sources ↓

AT A GLANCE

Three agents, three very different scales.

Each project reports its own headline numbers in its own units — stars, users, tokens — so the cards below show each on its own terms.

Hermes Agent

MakerNous Research

LaunchedFeb 2026

LicenseMIT

GitHub stars~140,000 (in 3 mo.)

OpenRouter volume770B tokens/day

Primary interfaceCLI · messaging · IDE · API

Kilo Code

MakerKilo (fork of Roo/Cline)

LaunchedMar 2025

LicenseOpen source + hosted gateway

Users1.5M+ (3M+ "Kilo Coders")

Tokens processed40T+ (cumulative)

OpenRouter volume235B tokens/day

Primary interfaceVS Code · JetBrains · CLI · mobile

OpenClaw

MakerPeter Steinberger

LaunchedNov 2025 (as Clawdbot)

LicenseMIT

GitHub stars~247,000

OpenRouter volume161B tokens/day

Primary interfaceMessaging apps · device nodes

OpenRouter daily token volume — global app rankings

Hermes Agent

770B/day

Kilo Code

235B/day

OpenClaw

161B/day

Live figures from OpenRouter's global app rankings (openrouter.ai/apps), accessed July 2026 — all three now report through OpenRouter, correcting an earlier assumption on this page that Kilo Code was gateway-only. Hermes overtook OpenClaw on this metric back in mid-May 2026 and has grown substantially since. For context, Claude Code (not one of the three agents covered here) ranks between Kilo Code and OpenClaw at 218B/day.

THE HARNESS

Mapped against the eleven-layer harness.

Each agent's documented behavior mapped against the same eleven canonical layers used across Build A Harness — not an ad hoc feature list.

● present — a real, documented mechanism ◐ partial — an ad hoc or static substitute ○ not described in public docs

Harness layer	Hermes Agent	Kilo Code	OpenClaw
1. Caller State	○Not described	◐Agent "modes" (code/ask/plan/debug) fix tools + instructions per task	◐Messaging channel/persona selects context — not a formal constraint system
2. World Model	○Raw history only	◐Static Memory Bank files	◐Static persona/bootstrap files
3. Reasoning	○Not described	○Not described	○Not described
4. Control State	○Not described	◐Binary approval gate, not a tiered resolver	○Queueing prevents races, isn't a risk control
5. Planning	◐Subagents get their own capped budget via delegate_task — no full dependency graph	◐Subagent delegation for isolated subtasks	○Not described
6. Execution	◐70+ tools, ~28 toolsets, 6 backends, concurrent dispatch	●Git-snapshot checkpoints + diff-repair + tool-call translation	◐Broadest action surface (device/OS tools), weakly gated
7. Verification	○Not described	◐Informal completion check + diff-repair	○Not described
8. Recovery	◐Provider fallback, fixed iteration budget	◐Checkpoint rollback, diff-repair	◐Auto-compaction retry only
9. Memory	●SQLite+FTS5, context compression with a 20-message floor	◐Memory Bank, context condensing	◐Raw JSONL transcript, token-limit + compaction reserve
10. Learning	●Self-improving skills — the project's headline differentiator	○Not described	○Explicitly static — no self-improvement
11. Output & Reviewer Pass	○Not described	○Not described — human review substitutes	○Reply shaping is cosmetic, not a quality gate

Reading the pattern: Execution, Memory, and Learning are the only layers where any agent reaches full coverage — Kilo's checkpoints, and Hermes's persistent memory and self-improving skills, respectively. Reasoning and Output & Reviewer Pass get zero coverage across all three. The rest — Caller State, World Model, Control State, Planning, Verification, and Recovery — are partial at best everywhere: never fully absent, never fully built out either. Which is a fair description of where most production agents are today, popular or not.

What it's actually like to run one.

Beyond the layer-by-layer mapping, each project reads differently in practice:

Practical dimension	Hermes Agent	Kilo Code	OpenClaw
Model / provider breadth	18+ providers, open-weight models	500+ models, 60+ providers via the Kilo Gateway	Claude, GPT, DeepSeek, or local — OAuth piggyback on existing subscriptions
Primary reach	CLI, 20 messaging platforms, IDE (ACP), API, cron	VS Code, JetBrains, CLI, cloud agent, mobile	WhatsApp, Telegram, Discord, Signal, iMessage, Slack, Teams, native device apps
Business model	Free, MIT; runs on a few-dollar-a-month VPS	Free/BYOK or Kilo credits, zero markup on provider rates; $8M seed round	Free, MIT, self-hosted; foundation stewardship after the founder joined OpenAI
Known risk	Newest of the three (Feb 2026) — memory and self-improving skills lack a long track record	Parsing reliability is a constant fight across a 500-model matrix — not a novel loop, an inherited one	RCE CVE (CVSS 8.8), 1,000+ malicious marketplace skills reported, prompt-injection susceptible; restricted for state use in China

In Build A Harness

All 11 layers ship as drawable, composable canvas nodes — Reasoning, Control State, Verification, and Output & Reviewer Pass included, the four layers thinnest across all three agents above. See the full architecture →

THE LOOP

Same commodity loop, underneath all three.

Every layer above wraps around a loop that, on its own, isn't a differentiator. Each project's own documentation says as much — and mapped against the same seven-stage loop, the pattern is clear.

● present ◐ partial ○ not described

Loop stage	Hermes Agent	Kilo Code	OpenClaw
1. Perceive	○Raw session history, no typed beliefs	◐Static Memory Bank files	◐Static bootstrap/persona files
2. Reason	○Not described	○Not described	○Not described
3. Decide	○Provider fallback only	◐Binary human-approval gate	○Queueing, not risk-tiering
4. Act	◐Concurrent tool dispatch, no risk review	●Git checkpoint before every edit	◐Broad device/OS action surface
5. Verify	○Not described	◐Informal check + diff-repair	○Cosmetic reply shaping only
6. Recover	◐Fallback provider, iteration budget	◐Checkpoint rollback, diff-repair	◐Auto-compaction retry only
7. Learn & Repeat	●Persistent memory + self-improving skills	◐Memory Bank persists across sessions	○Static — explicitly no self-improvement

See each agent's full, unsimplified step list

Hermes Agent — 8 steps

Generate a task ID, append the user message
Build or reuse the cached system prompt
Check whether preflight compression is needed (>50% context full)
Build API messages per API mode
Inject ephemeral prompt layers (budget/context warnings)
Apply prompt caching markers (Anthropic)
Make an interruptible API call
Parse response: execute tool calls and loop to step 4, or persist + return

Kilo Code — 7 steps

User gives a task, selects an agent mode (code/ask/plan/debug)
Kilo assembles context: system prompt, auto-found files, @-mentions, Memory Bank
Prompt + tools sent to the configured model via the Kilo Gateway
Model replies with reasoning plus typically one tool call
State-changing actions pause for user approval unless auto-approved
Approved action executes; result captured
If complete, present for review; else append observation and loop to step 3

OpenClaw — 7 steps

Intake — Gateway RPC/CLI validates params, returns {runId, acceptedAt} immediately
Queue & resolve — serialize via per-session + global queues, resolve model/auth
Session & workspace prep — resolve workspace, load skills, inject bootstrap files
Prompt assembly — base + skills + bootstrap context, token limits enforced
Model inference ⇄ tool execution — stream text/tool-call deltas, execute, repeat
Reply shaping — assemble final payload, filter internal tokens
Persist & emit — write transcript to JSONL, emit lifecycle:end

Hermes Agent docs

"On its own, this loop is a fairly conventional ReAct-style tool-calling loop — structurally similar to what Claude Code, OpenClaw, and most other agent harnesses do..."

Kilo Code docs

"On its own, this is a conventional ReAct-style tool-calling loop — structurally the same shape used by Cline, Roo Code, Claude Code, Cursor, and most other agent harnesses."

OpenClaw docs

"On its own, this is a conventional ReAct-style tool-calling loop... The loop mechanics are not OpenClaw's differentiator."

FAQ

Questions, answered.

Why compare Hermes Agent, Kilo Code, and OpenClaw specifically?

They're three of the most-used open-source AI agents running today, each with public architecture documentation detailed enough to map against a common harness taxonomy — and each one explicitly describes its own loop as conventional in its own docs.

Which of the three has the best harness?

None covers more than 2 of 11 canonical layers at full strength. Hermes leads on Memory and Learning. Kilo Code leads on Execution. OpenClaw leads on reach, but its broad, weakly-gated action surface lines up with its documented security incidents — an RCE CVE and a marketplace incident involving over a thousand malicious skills.

What is the 11-layer harness architecture?

The canonical architecture used across Build A Harness: Caller State, World Model, Reasoning, Control State, Planning, Execution, Verification, Recovery, Memory, Learning, and Output & Reviewer Pass. Full definition →

Where does Build A Harness fit in?

Build A Harness implements all 11 layers as drawable, composable canvas nodes — so you can build the layers these three agents leave thin (Reasoning, Control State, Verification, Output & Reviewer Pass) without writing that infrastructure yourself.

SOURCES

Primary documentation.

This comparison draws on each project's own architecture docs and independent reporting, current as of mid-2026. Token-volume figures are live from OpenRouter's global app rankings, accessed July 2026 — those numbers move daily and will drift from this snapshot.

Hermes Agent

Kilo Code

Using Agents — Kilo Code docs
Checkpoints — Kilo Code docs
Inside Kilo Code — Tessl
Kilo's own OpenClaw-vs-Hermes page (vendor comparison — corroborating color only)

OpenClaw

The layer-by-layer coverage mapping above is our own reading of each project's public documentation against Build A Harness's canonical taxonomy, not an official audit or a claim endorsed by any of the three projects.

BUILD A HARNESS · v0.8.0 · PUBLIC ALPHA

Build the layers
these three leave thin.

Reasoning, Control State, Verification, and Output & Reviewer Pass — the layers where all three of the most-used open-source agents are thinnest — ship as drawable canvas nodes in Build A Harness. Apache 2.0. Runs locally via Docker.

Star on GitHub Explore Build A Harness → Read the loop-first version →

Three popular agents.One thin harness, three ways.

Three agents, three very different scales.

Mapped against the eleven-layer harness.

What it's actually like to run one.

Same commodity loop, underneath all three.

Questions, answered.

Why compare Hermes Agent, Kilo Code, and OpenClaw specifically?

Which of the three has the best harness?

What is the 11-layer harness architecture?

Where does Build A Harness fit in?

Primary documentation.

Build the layersthese three leave thin.

Three popular agents.
One thin harness, three ways.

Build the layers
these three leave thin.