From 577c287aae5c27c7f0a77ec4ce94fd3be1eed57f Mon Sep 17 00:00:00 2001
From: Cossackx <121278003+Cossackx@users.noreply.github.com>
Date: Wed, 13 May 2026 23:03:08 -0400
Subject: [PATCH] docs: add autoresearch operating contract (#435)

Co-authored-by: RAZSOC Local <razsoc@local>
---
 agents/researcher/README.md |   7 +
 docs/swarm/AUTORESEARCH.md  | 254 ++++++++++++++++++++++++++++++++++++
 docs/swarm/README.md        |   6 +-
 docs/swarm/ROLES.md         |   2 +-
 4 files changed, 266 insertions(+), 3 deletions(-)
 create mode 100644 docs/swarm/AUTORESEARCH.md
diff --git a/agents/researcher/README.md b/agents/researcher/README.md
index b16933ca..c3f9b083 100644
--- a/agents/researcher/README.md
+++ b/agents/researcher/README.md
@@ -16,4 +16,11 @@ gbrain
 ## Plugins
 none
 
+## Mode split
+
+- `researcher:quick`: default. Brain-first lookup, external source collection, synthesis, citations, and recommendations.
+- `researcher:autoresearch`: gated optimization loop only. Do not start unless Goal, Scope, Mutable target, Locked eval, Metric, Direction, Verify, Guard, Iterations, Results log, Rollback, and Greenlight boundaries are explicit.
+
+The source-owned operating contract is `docs/swarm/AUTORESEARCH.md`.
+
 This file mirrors `swarm.yaml` and the profile config under `~/.hermes/profiles/researcher/`.
diff --git a/docs/swarm/AUTORESEARCH.md b/docs/swarm/AUTORESEARCH.md
new file mode 100644
index 00000000..96bf6df5
--- /dev/null
+++ b/docs/swarm/AUTORESEARCH.md
@@ -0,0 +1,254 @@
+# Autoresearch Mode
+
+Autoresearch is a bounded optimization harness for Hermes Agents. It is not the default research workflow.
+
+Use it only when the system can mechanically decide whether an iteration improved.
+
+```text
+normal research     = gather evidence -> synthesize -> recommend
+autoresearch mode   = mutate one target -> verify metric -> keep/revert -> repeat
+```
+
+## Source pattern
+
+The useful pattern from Karpathy-style autoresearch and downstream Claude/Codex ports is stable:
+
+1. Lock the scope.
+2. Lock the evaluation surface.
+3. Pick one scalar metric.
+4. Mutate one narrow target.
+5. Run a mechanical verifier.
+6. Keep improvements.
+7. Revert worse/crashing/guard-failing changes.
+8. Log every iteration.
+9. Stop at the configured budget.
+
+If you cannot evaluate it mechanically, do not autoresearch it.
+
+## When to use `researcher:quick`
+
+Use normal researcher mode for:
+
+- web/GitHub/X/Reddit/Medium/YouTube/source collection
+- market/model/library scans
+- literature review
+- qualitative synthesis
+- tradeoff notes
+- recommendations where judgment matters
+
+`researcher:quick` may produce an autoresearch config, but it should not start the loop unless the contract below is filled.
+
+## Autoresearch entry contract
+
+A loop may start only when these fields are explicit:
+
+```yaml
+goal: <one sentence outcome>
+scope: <files/directories/knobs the loop may edit>
+mutable_target: <specific file, skill, prompt, or narrow directory>
+locked_eval: <files/datasets/scoring scripts the loop may not edit>
+metric: <scalar number and unit>
+direction: higher|lower
+verify: <command that emits or lets us parse the metric>
+guard: <command(s) that must keep passing>
+iterations: <bounded count; default pilot is 3-5>
+time_budget: <optional wall-clock cap>
+results_log: autoresearch-results/results.tsv
+rollback: revert worse, crashing, unparsable, or guard-failing changes
+greenlight: required for destructive, public, credential, account, push, deploy, merge, or bulk edits
+```
+
+Do not infer missing fields silently. If a field is unknown, run `autoresearch:plan` / planning mode first.
+
+## Iteration discipline
+
+Each iteration should follow this shape:
+
+```text
+1. Read current state, prior results log, and recent git history.
+2. Pick one small, falsifiable change.
+3. Edit only allowed mutable targets.
+4. Commit or checkpoint the candidate.
+5. Run verify and guard commands.
+6. Parse metric.
+7. If improved and guards pass: keep.
+8. If worse, equal-with-more-complexity, crashed, or guards fail: revert.
+9. Append results_log.
+10. Continue until iteration/time budget is exhausted.
+```
+
+Use simplicity as a tie-breaker: equal metric with less code/complexity may be kept; equal metric with more complexity must be reverted.
+
+## Required log shape
+
+Use TSV or JSONL. TSV default:
+
+```tsv
+iteration	commit	metric	delta	status	summary	verify	guard
+0	baseline	42	0	baseline	initial metric	pass	pass
+1	abc123	39	-3	keep	reduced failing lint count in parser	pass	pass
+2	-	45	+6	revert	broadened change broke type guard	pass	fail
+```
+
+Keep failures visible. Reverting a failed experiment is part of the evidence trail, not a problem to hide.
+
+## Role ownership
+
+- `orchestrator`: approves entering autoresearch, locks scope/eval/metric/budget, and decides whether the loop may run in durable/background mode.
+- `researcher:quick`: gathers external/internal evidence and may draft the contract.
+- `researcher:autoresearch`: runs the loop after the contract is complete.
+- `reviewer`: checks kept changes for metric hacking, overfitting, security regressions, and hidden scope expansion.
+- `qa`: replays final verification and any browser/API smoke.
+- `km-agent`: promotes durable lessons/results into RAZSOC/GBrain after review.
+
+## Good targets for this stack
+
+### 1. Hermes skill optimization
+
+Improve one skill against fixed prompts and binary rubric checks.
+
+```yaml
+goal: Improve reviewer-core bug catching without increasing false positives.
+scope:
+  - /home/aleks/.hermes/skills/**/reviewer-core/SKILL.md
+mutable_target: reviewer-core/SKILL.md
+locked_eval:
+  - evals/reviewer-core/cases/*.md
+  - evals/reviewer-core/rubric.json
+metric: rubric score out of 100
+direction: higher
+verify: python evals/reviewer-core/run_eval.py --json
+guard: hermes chat -Q -t reviewer:gate -q 'load reviewer-core and summarize readiness' | grep -q reviewer
+iterations: 3
+```
+
+### 2. Profile prompt optimization
+
+Tune one profile against fixed briefs.
+
+```yaml
+goal: Make researcher choose GBrain-first lookup reliably before web search.
+scope:
+  - /home/aleks/.hermes/profiles/researcher/SOUL.md
+  - /home/aleks/.hermes/profiles/researcher/skills/researcher-quick/SKILL.md
+mutable_target: researcher profile guidance
+locked_eval:
+  - evals/researcher-routing/cases.jsonl
+metric: pass rate across routing cases
+direction: higher
+verify: python evals/researcher-routing/run_eval.py
+guard: hermes chat -Q -t researcher:quick -q 'respond with mode readiness only'
+iterations: 3
+```
+
+### 3. GBrain retrieval routing
+
+Optimize route rules/prompts against known-answer fixtures. The corpus and answer key are locked.
+
+```yaml
+goal: Improve citation-correct answers for RAZSOC/GBrain architecture questions.
+scope:
+  - skills/note-taking/gbrain/SKILL.md
+  - profiles/km-agent/SOUL.md
+mutable_target: retrieval/routing guidance only
+locked_eval:
+  - evals/gbrain-routing/questions.jsonl
+  - evals/gbrain-routing/answers.jsonl
+metric: exact-or-cited-correct score
+direction: higher
+verify: python evals/gbrain-routing/run_eval.py --max-cases 12
+guard: gbrain stats >/dev/null
+iterations: 3
+```
+
+### 4. Repo cleanup loop
+
+Reduce one failure class with focused guards.
+
+```yaml
+goal: Reduce no-explicit-any count in changed TypeScript files.
+scope:
+  - src/**/*.ts
+  - src/**/*.tsx
+mutable_target: one module or route family per iteration
+locked_eval:
+  - package.json
+  - eslint config
+metric: eslint no-explicit-any violation count
+direction: lower
+verify: pnpm exec eslint src --format json | python scripts/count-eslint-rule.py @typescript-eslint/no-explicit-any
+guard: pnpm exec vitest run <focused-tests>
+iterations: 5
+```
+
+### 5. Browser/QA harness improvement
+
+Use only deterministic checks.
+
+```yaml
+goal: Increase deterministic /swarm smoke coverage.
+scope:
+  - tests/browser/swarm-smoke.*
+  - src/routes/**/swarm*
+mutable_target: smoke test file first; product code only with explicit approval
+locked_eval:
+  - expected role list
+  - API response assertions
+metric: passing smoke assertions count
+direction: higher
+verify: pnpm exec playwright test tests/browser/swarm-smoke.spec.ts --reporter=json
+guard: pnpm exec vitest run src/server/swarm-health.test.ts
+iterations: 3
+```
+
+## Bad targets / red flags
+
+Do not run autoresearch when:
+
+- the loop can edit the eval, dataset, scorer, or answer key
+- the metric is a proxy that can be gamed easily
+- the desired improvement is mostly taste or strategy
+- the work touches secrets, account settings, public posting, deploys, merges, or destructive cleanup
+- the scope is broad enough to rewrite the vault/repo
+- the verification command is slow, flaky, or manually judged
+- the agent cannot parse the metric deterministically
+
+Common reward-hacking examples:
+
+- deleting hard tests to improve pass rate
+- changing a rubric/answer key instead of behavior
+- caching fixture outputs instead of solving the task
+- suppressing errors instead of fixing causes
+- narrowing search to known examples only
+- adding brittle sleeps/retries to hide flake
+
+## Pilot before background
+
+Default wedge:
+
+1. Run `researcher:quick` to draft the contract.
+2. Run `reviewer` on the contract for metric-hacking risk.
+3. Run `researcher:autoresearch` for 3 iterations foreground/durable-session only.
+4. Run `reviewer` on kept diffs.
+5. Run `qa` or focused verification.
+6. Let `km-agent` capture only durable lessons.
+
+Only after a clean pilot should an orchestrator approve a longer or background loop.
+
+## Exit report
+
+Every run must finish with:
+
+```text
+Goal:
+Scope:
+Metric baseline -> final:
+Iterations attempted:
+Kept changes:
+Reverted changes:
+Verification:
+Guard result:
+Reward-hacking review:
+Remaining risks:
+Next recommended loop or stop condition:
+```
diff --git a/docs/swarm/README.md b/docs/swarm/README.md
index fa14e783..e801300d 100644
--- a/docs/swarm/README.md
+++ b/docs/swarm/README.md
@@ -16,6 +16,7 @@ This is not a chat wrapper with tabs. It is the operating surface for a local ag
 
 - [QUICKSTART.md](./QUICKSTART.md) — clone, run, detect profiles, spawn workers, dispatch the first task.
 - [ARCHITECTURE.md](./ARCHITECTURE.md) — loop, SwarmBrief shape, notification routing, lanes, review, repair.
+- [AUTORESEARCH.md](./AUTORESEARCH.md) — bounded optimization-loop contract for `researcher:autoresearch`.
 - [SKILLS.md](./SKILLS.md) — bundled swarm skills, auto-loading, and custom skill conventions.
 - [ROLES.md](./ROLES.md) — role presets used by the Add Swarm dialog and the canonical project specs.
 
@@ -96,8 +97,9 @@ Read these in order if you are testing the v1 release:
 
 1. [QUICKSTART.md](./QUICKSTART.md)
 2. [ARCHITECTURE.md](./ARCHITECTURE.md)
-3. [ROLES.md](./ROLES.md)
-4. [SKILLS.md](./SKILLS.md)
+3. [AUTORESEARCH.md](./AUTORESEARCH.md)
+4. [ROLES.md](./ROLES.md)
+5. [SKILLS.md](./SKILLS.md)
 
 ## Canonical spec
 
diff --git a/docs/swarm/ROLES.md b/docs/swarm/ROLES.md
index 6b5473a3..d276dd05 100644
--- a/docs/swarm/ROLES.md
+++ b/docs/swarm/ROLES.md
@@ -218,7 +218,7 @@ Canonical spec:
 /swarm-specs/projects/swarm4.md
 ```
 
-Sage drafts; humans approve public posting.
+Sage drafts; humans approve public posting. Use normal research for evidence gathering and synthesis. Use autoresearch only for bounded optimization loops with an explicit Goal/Scope/Metric/Verify/Guard/Iterations contract; see [AUTORESEARCH.md](./AUTORESEARCH.md).
 
 ## Scribe