The Agent Harness

The agent doesn't know your codebase. That's your problem to solve.

Three sessions in, it breaks a contract it read last week. You explain why. It works. A week later, different contract, same outcome. You're not debugging the agent — you're debugging the absence of memory.

This isn't a model quality problem. The agent starts every session knowing nothing about your system: how it's organized, what it must never touch, which invariants took you months to arrive at. You've been solving that by re-explaining. That doesn't scale — not at the session velocity you're running, not if you want it to work through tasks without hand-holding.

The fix isn't a longer prompt. It's a harness — a system of structured documents the agent loads before it touches anything, maintains as it works, and gates on before it ships.

The architecture behind this — the three-tier memory model, why the always-loaded layer needs to stay small, what happens when specs go stale — is laid out in an earlier piece. This is what building it actually looks like: a specific harness, constructed across 335 tasks on a real production codebase, with the files and rules that make it hold.

Three kinds of memory

The agent has three kinds of context to work with. Getting the architecture right matters more than getting any individual document right.

Hot memory loads every session, unconditionally. This is your constitution. It states what the project is, what commands to run, what conventions are non-negotiable, and where to look when the task touches something specific. Keep it short. Everything that can be deferred into the second or third tier should be.

Warm memory loads when the task touches relevant territory. Subsystem specs, task backlogs, test documents, data model contracts. The agent doesn't need all of it every session. It needs to know where to get it and when.

Cold memory loads on escalation. The full product spec. The full data model. The canonical source documents that answer questions the faster-loaded context can't. Most tasks never touch this layer. The ones that do need it completely.

The mistake almost everyone makes: putting everything in hot memory. The constitution gets long. Long constitutions get ignored — not because agents are lazy, but because relevant signal gets buried in noise. Ruthless deferral is the discipline.

Write the constitution first

Your constitution lives at AGENTS.md (Codex) or CLAUDE.md (Claude Code). Both tools load it automatically at session start.

Keep it short. Ruthlessly short.

It should contain exactly: what the project is and does, the commands to install, test, lint, and run, the conventions that must never be broken, and a routing table — file globs mapped to the documents the agent should read first.

## Routing triggers

| When files match                | Read first                          |
|---------------------------------|-------------------------------------|
| `runtime/go/internal/auth/**`   | `docs/spec/bootstrap.md` §auth      |
| `runtime/go/cmd/daemon/**`      | `docs/spec/bootstrap.md` + spec     |
| `docs/tests-daemon.md`          | `scripts/run_tests_daemon.sh`       |
| `docs/tests-cli.md`             | `scripts/run_tests_cli.sh`          |

The routing table is what makes a constitution useful instead of decorative. Without it, the agent has to guess what to read. With it, the changed files tell the agent exactly where to go. If your constitution doesn't have one, it has rules with no navigation.

Load the minimum. Escalate when necessary.

The agent shouldn't load everything. It should load the minimum needed for this task, with a clear escalation path when that's not enough.

A load order that works:

1. docs/context/index.yaml          — always
2. docs/spec/bootstrap.md           — always
3. docs/ops/tasks.md                — always
4. UI files (spec-ui, tests-ui)     — only if task touches UI
5. spec.md + data-model.md          — only if contracts are changing

The key discipline is diff-first loading: before reading anything, run git diff --name-only and load only the files that were touched plus their directly related contracts. Escalate to the full spec only when scoped sources can't answer the question.

A task that renames a flag in the CLI doesn't need the full data model spec. One that changes a WebSocket event envelope does. The distinction matters — loading everything means no task gets the context it actually needs.

Give every task its own memory

Your task backlog tells the agent what to do. A plan file tells it how to approach this specific task — what the goal is, what the acceptance criteria are, which files are in scope, what not to break.

Keep active plans in docs/plans/active/. When a task completes, move the plan to docs/plans/completed/ with a concrete list of what changed. Add a sibling context file in docs/plans/context/. This is the task's running memory — the decisions made, the dead ends hit, the invariants checked. Not a status report. What the agent would need to read if it resumed this task cold.

docs/plans/
├─ active/
│  └─ T-336-add-model-timeout.md
├─ completed/
│  └─ T-335-daemon-disconnect-fix.md
└─ context/
   └─ T-335-daemon-disconnect-root-cause-fix.md

The plan file scopes the task. The context file is the running memory. Together they let a task survive session boundaries without losing state — which means you can stop mid-task, come back, and pick up exactly where you left off.

Couple docs to code as an invariant

This is the rule that prevents drift.

If a test document changes, the matching runner script must change in the same commit. No exceptions.

docs/tests-daemon.md    →  scripts/run_tests_daemon.sh
docs/tests-cli.md       →  scripts/run_tests_cli.sh
docs/tests-ui.md        →  scripts/run_tests_ui.sh

The test documents are the authoritative description of what the system should do. The runner scripts are what actually validates it. Let them drift apart and you end up with docs that describe behavior the tests don't check, or tests that check behavior the docs don't mention. The agent will confidently follow whichever one it loaded last.

Write this into the constitution explicitly. The agent follows rules it's given. Give it the rule, and the coupling holds.

Gate every task on a harness check

Before any task can be marked done, it must pass check_harness.sh.

This script isn't a test runner. It's an architecture gate — a separate layer that checks things tests can't easily verify. Does every HTTP route in the code appear in the OpenAPI spec? Does any package write to SQLite directly, bypassing the single-writer queue? Do all required files exist? Does go vet pass cleanly?

The completion checklist in the constitution makes this explicit:

## Completion Checklist

Before marking a task done:
1. Acceptance criteria are satisfied.
2. Task removed from active backlog, added to completed archive.
3. Plan file moved from active/ to completed/.
4. Manual test steps updated in the relevant tests doc.
5. Matching runner script updated if test steps changed.
6. scripts/check_harness.sh passes.

This is what turns "code written" into "done." Without the checklist, tasks end when the agent thinks the work is complete. With it, tasks end when the harness agrees.

Define stop conditions — and only those

The biggest failure mode for autonomous agents isn't bad code. It's stopping too often.

The agent asks "should I proceed?" when it should just proceed. It pauses at task boundaries to report progress when it should move to the next task. Every unnecessary stop is a hand-holding session you didn't ask for.

Define valid stop conditions explicitly in the constitution:

## Continuous execution

After completing a task, immediately move to the next highest-priority
unblocked task without waiting for an additional instruction.

Valid stop conditions:
- Missing product decision the task depends on
- Required secrets or credentials not available
- Unresolvable failure that cannot be handled locally

Do not stop at task boundaries just to report progress.

The agent will follow this. A session that starts on T-336 works through T-337 and T-338 without hand-holding — stopping only when something genuinely blocks progress. That's the session you wanted.

The harness doesn't make the agent smarter. It makes the context reliable and the rules explicit.

The agent already knows how to code. What it lacks is the project memory you've built up over months of working in the system.

Write that memory down. Structure it for machines, not humans. Keep the always-loaded layer small. Build the escalation path. Add the gate.

Then let it run.