docs: add autoresearch operating contract (#435)

Co-authored-by: RAZSOC Local <razsoc@local>
2026-05-13 23:03:08 -04:00
parent f5fc172cc0
commit 577c287aae
4 changed files with 266 additions and 3 deletions
--- a/agents/researcher/README.md
+++ b/agents/researcher/README.md
@@ -16,4 +16,11 @@ gbrain
 ## Plugins
 none
 ## Mode split
 - `researcher:quick`: default. Brain-first lookup, external source collection, synthesis, citations, and recommendations.
 - `researcher:autoresearch`: gated optimization loop only. Do not start unless Goal, Scope, Mutable target, Locked eval, Metric, Direction, Verify, Guard, Iterations, Results log, Rollback, and Greenlight boundaries are explicit.
 The source-owned operating contract is `docs/swarm/AUTORESEARCH.md`.
 This file mirrors `swarm.yaml` and the profile config under `~/.hermes/profiles/researcher/`.
--- a/docs/swarm/AUTORESEARCH.md
+++ b/docs/swarm/AUTORESEARCH.md
@@ -0,0 +1,254 @@
 # Autoresearch Mode
 Autoresearch is a bounded optimization harness for Hermes Agents. It is not the default research workflow.
 Use it only when the system can mechanically decide whether an iteration improved.
 ```text
 normal research     = gather evidence -> synthesize -> recommend
 autoresearch mode   = mutate one target -> verify metric -> keep/revert -> repeat
 ```
 ## Source pattern
 The useful pattern from Karpathy-style autoresearch and downstream Claude/Codex ports is stable:
 1. Lock the scope.
 2. Lock the evaluation surface.
 3. Pick one scalar metric.
 4. Mutate one narrow target.
 5. Run a mechanical verifier.
 6. Keep improvements.
 7. Revert worse/crashing/guard-failing changes.
 8. Log every iteration.
 9. Stop at the configured budget.
 If you cannot evaluate it mechanically, do not autoresearch it.
 ## When to use `researcher:quick`
 Use normal researcher mode for:
 - web/GitHub/X/Reddit/Medium/YouTube/source collection
 - market/model/library scans
 - literature review
 - qualitative synthesis
 - tradeoff notes
 - recommendations where judgment matters
 `researcher:quick` may produce an autoresearch config, but it should not start the loop unless the contract below is filled.
 ## Autoresearch entry contract
 A loop may start only when these fields are explicit:
 ```yaml
 goal: <one sentence outcome>
 scope: <files/directories/knobs the loop may edit>
 mutable_target: <specific file, skill, prompt, or narrow directory>
 locked_eval: <files/datasets/scoring scripts the loop may not edit>
 metric: <scalar number and unit>
 direction: higher|lower
 verify: <command that emits or lets us parse the metric>
 guard: <command(s) that must keep passing>
 iterations: <bounded count; default pilot is 3-5>
 time_budget: <optional wall-clock cap>
 results_log: autoresearch-results/results.tsv
 rollback: revert worse, crashing, unparsable, or guard-failing changes
 greenlight: required for destructive, public, credential, account, push, deploy, merge, or bulk edits
 ```
 Do not infer missing fields silently. If a field is unknown, run `autoresearch:plan` / planning mode first.
 ## Iteration discipline
 Each iteration should follow this shape:
 ```text
 1. Read current state, prior results log, and recent git history.
 2. Pick one small, falsifiable change.
 3. Edit only allowed mutable targets.
 4. Commit or checkpoint the candidate.
 5. Run verify and guard commands.
 6. Parse metric.
 7. If improved and guards pass: keep.
 8. If worse, equal-with-more-complexity, crashed, or guards fail: revert.
 9. Append results_log.
 10. Continue until iteration/time budget is exhausted.
 ```
 Use simplicity as a tie-breaker: equal metric with less code/complexity may be kept; equal metric with more complexity must be reverted.
 ## Required log shape
 Use TSV or JSONL. TSV default:
 ```tsv
 iteration	commit	metric	delta	status	summary	verify	guard
 0	baseline	42	0	baseline	initial metric	pass	pass
 1	abc123	39	-3	keep	reduced failing lint count in parser	pass	pass
 2	-	45	+6	revert	broadened change broke type guard	pass	fail
 ```
 Keep failures visible. Reverting a failed experiment is part of the evidence trail, not a problem to hide.
 ## Role ownership
 - `orchestrator`: approves entering autoresearch, locks scope/eval/metric/budget, and decides whether the loop may run in durable/background mode.
 - `researcher:quick`: gathers external/internal evidence and may draft the contract.
 - `researcher:autoresearch`: runs the loop after the contract is complete.
 - `reviewer`: checks kept changes for metric hacking, overfitting, security regressions, and hidden scope expansion.
 - `qa`: replays final verification and any browser/API smoke.
 - `km-agent`: promotes durable lessons/results into RAZSOC/GBrain after review.
 ## Good targets for this stack
 ### 1. Hermes skill optimization
 Improve one skill against fixed prompts and binary rubric checks.
 ```yaml
 goal: Improve reviewer-core bug catching without increasing false positives.
 scope:
  - /home/aleks/.hermes/skills/**/reviewer-core/SKILL.md
 mutable_target: reviewer-core/SKILL.md
 locked_eval:
  - evals/reviewer-core/cases/*.md
  - evals/reviewer-core/rubric.json
 metric: rubric score out of 100
 direction: higher
 verify: python evals/reviewer-core/run_eval.py --json
 guard: hermes chat -Q -t reviewer:gate -q 'load reviewer-core and summarize readiness' | grep -q reviewer
 iterations: 3
 ```
 ### 2. Profile prompt optimization
 Tune one profile against fixed briefs.
 ```yaml
 goal: Make researcher choose GBrain-first lookup reliably before web search.
 scope:
  - /home/aleks/.hermes/profiles/researcher/SOUL.md
  - /home/aleks/.hermes/profiles/researcher/skills/researcher-quick/SKILL.md
 mutable_target: researcher profile guidance
 locked_eval:
  - evals/researcher-routing/cases.jsonl
 metric: pass rate across routing cases
 direction: higher
 verify: python evals/researcher-routing/run_eval.py
 guard: hermes chat -Q -t researcher:quick -q 'respond with mode readiness only'
 iterations: 3
 ```
 ### 3. GBrain retrieval routing
 Optimize route rules/prompts against known-answer fixtures. The corpus and answer key are locked.
 ```yaml
 goal: Improve citation-correct answers for RAZSOC/GBrain architecture questions.
 scope:
  - skills/note-taking/gbrain/SKILL.md
  - profiles/km-agent/SOUL.md
 mutable_target: retrieval/routing guidance only
 locked_eval:
  - evals/gbrain-routing/questions.jsonl
  - evals/gbrain-routing/answers.jsonl
 metric: exact-or-cited-correct score
 direction: higher
 verify: python evals/gbrain-routing/run_eval.py --max-cases 12
 guard: gbrain stats >/dev/null
 iterations: 3
 ```
 ### 4. Repo cleanup loop
 Reduce one failure class with focused guards.
 ```yaml
 goal: Reduce no-explicit-any count in changed TypeScript files.
 scope:
  - src/**/*.ts
  - src/**/*.tsx
 mutable_target: one module or route family per iteration
 locked_eval:
  - package.json
  - eslint config
 metric: eslint no-explicit-any violation count
 direction: lower
 verify: pnpm exec eslint src --format json | python scripts/count-eslint-rule.py @typescript-eslint/no-explicit-any
 guard: pnpm exec vitest run <focused-tests>
 iterations: 5
 ```
 ### 5. Browser/QA harness improvement
 Use only deterministic checks.
 ```yaml
 goal: Increase deterministic /swarm smoke coverage.
 scope:
  - tests/browser/swarm-smoke.*
  - src/routes/**/swarm*
 mutable_target: smoke test file first; product code only with explicit approval
 locked_eval:
  - expected role list
  - API response assertions
 metric: passing smoke assertions count
 direction: higher
 verify: pnpm exec playwright test tests/browser/swarm-smoke.spec.ts --reporter=json
 guard: pnpm exec vitest run src/server/swarm-health.test.ts
 iterations: 3
 ```
 ## Bad targets / red flags
 Do not run autoresearch when:
 - the loop can edit the eval, dataset, scorer, or answer key
 - the metric is a proxy that can be gamed easily
 - the desired improvement is mostly taste or strategy
 - the work touches secrets, account settings, public posting, deploys, merges, or destructive cleanup
 - the scope is broad enough to rewrite the vault/repo
 - the verification command is slow, flaky, or manually judged
 - the agent cannot parse the metric deterministically
 Common reward-hacking examples:
 - deleting hard tests to improve pass rate
 - changing a rubric/answer key instead of behavior
 - caching fixture outputs instead of solving the task
 - suppressing errors instead of fixing causes
 - narrowing search to known examples only
 - adding brittle sleeps/retries to hide flake
 ## Pilot before background
 Default wedge:
 1. Run `researcher:quick` to draft the contract.
 2. Run `reviewer` on the contract for metric-hacking risk.
 3. Run `researcher:autoresearch` for 3 iterations foreground/durable-session only.
 4. Run `reviewer` on kept diffs.
 5. Run `qa` or focused verification.
 6. Let `km-agent` capture only durable lessons.
 Only after a clean pilot should an orchestrator approve a longer or background loop.
 ## Exit report
 Every run must finish with:
 ```text
 Goal:
 Scope:
 Metric baseline -> final:
 Iterations attempted:
 Kept changes:
 Reverted changes:
 Verification:
 Guard result:
 Reward-hacking review:
 Remaining risks:
 Next recommended loop or stop condition:
 ```
--- a/docs/swarm/README.md
+++ b/docs/swarm/README.md
@@ -16,6 +16,7 @@ This is not a chat wrapper with tabs. It is the operating surface for a local ag
 - [QUICKSTART.md](./QUICKSTART.md) — clone, run, detect profiles, spawn workers, dispatch the first task.
 - [ARCHITECTURE.md](./ARCHITECTURE.md) — loop, SwarmBrief shape, notification routing, lanes, review, repair.
 - [AUTORESEARCH.md](./AUTORESEARCH.md) — bounded optimization-loop contract for `researcher:autoresearch`.
 - [SKILLS.md](./SKILLS.md) — bundled swarm skills, auto-loading, and custom skill conventions.
 - [ROLES.md](./ROLES.md) — role presets used by the Add Swarm dialog and the canonical project specs.
@@ -96,8 +97,9 @@ Read these in order if you are testing the v1 release:
 1. [QUICKSTART.md](./QUICKSTART.md)
 2. [ARCHITECTURE.md](./ARCHITECTURE.md)
-3. [ROLES.md](./ROLES.md)
+3. [AUTORESEARCH.md](./AUTORESEARCH.md)
-4. [SKILLS.md](./SKILLS.md)
+4. [ROLES.md](./ROLES.md)
 5. [SKILLS.md](./SKILLS.md)
 ## Canonical spec
--- a/docs/swarm/ROLES.md
+++ b/docs/swarm/ROLES.md
@@ -218,7 +218,7 @@ Canonical spec:
 /swarm-specs/projects/swarm4.md
 ```
-Sage drafts; humans approve public posting.
+Sage drafts; humans approve public posting. Use normal research for evidence gathering and synthesis. Use autoresearch only for bounded optimization loops with an explicit Goal/Scope/Metric/Verify/Guard/Iterations contract; see [AUTORESEARCH.md](./AUTORESEARCH.md).
 ## Scribe