From 577c287aae5c27c7f0a77ec4ce94fd3be1eed57f Mon Sep 17 00:00:00 2001 From: Cossackx <121278003+Cossackx@users.noreply.github.com> Date: Wed, 13 May 2026 23:03:08 -0400 Subject: [PATCH] docs: add autoresearch operating contract (#435) Co-authored-by: RAZSOC Local --- agents/researcher/README.md | 7 + docs/swarm/AUTORESEARCH.md | 254 ++++++++++++++++++++++++++++++++++++ docs/swarm/README.md | 6 +- docs/swarm/ROLES.md | 2 +- 4 files changed, 266 insertions(+), 3 deletions(-) create mode 100644 docs/swarm/AUTORESEARCH.md diff --git a/agents/researcher/README.md b/agents/researcher/README.md index b16933ca..c3f9b083 100644 --- a/agents/researcher/README.md +++ b/agents/researcher/README.md @@ -16,4 +16,11 @@ gbrain ## Plugins none +## Mode split + +- `researcher:quick`: default. Brain-first lookup, external source collection, synthesis, citations, and recommendations. +- `researcher:autoresearch`: gated optimization loop only. Do not start unless Goal, Scope, Mutable target, Locked eval, Metric, Direction, Verify, Guard, Iterations, Results log, Rollback, and Greenlight boundaries are explicit. + +The source-owned operating contract is `docs/swarm/AUTORESEARCH.md`. + This file mirrors `swarm.yaml` and the profile config under `~/.hermes/profiles/researcher/`. diff --git a/docs/swarm/AUTORESEARCH.md b/docs/swarm/AUTORESEARCH.md new file mode 100644 index 00000000..96bf6df5 --- /dev/null +++ b/docs/swarm/AUTORESEARCH.md @@ -0,0 +1,254 @@ +# Autoresearch Mode + +Autoresearch is a bounded optimization harness for Hermes Agents. It is not the default research workflow. + +Use it only when the system can mechanically decide whether an iteration improved. + +```text +normal research = gather evidence -> synthesize -> recommend +autoresearch mode = mutate one target -> verify metric -> keep/revert -> repeat +``` + +## Source pattern + +The useful pattern from Karpathy-style autoresearch and downstream Claude/Codex ports is stable: + +1. Lock the scope. +2. Lock the evaluation surface. +3. Pick one scalar metric. +4. Mutate one narrow target. +5. Run a mechanical verifier. +6. Keep improvements. +7. Revert worse/crashing/guard-failing changes. +8. Log every iteration. +9. Stop at the configured budget. + +If you cannot evaluate it mechanically, do not autoresearch it. + +## When to use `researcher:quick` + +Use normal researcher mode for: + +- web/GitHub/X/Reddit/Medium/YouTube/source collection +- market/model/library scans +- literature review +- qualitative synthesis +- tradeoff notes +- recommendations where judgment matters + +`researcher:quick` may produce an autoresearch config, but it should not start the loop unless the contract below is filled. + +## Autoresearch entry contract + +A loop may start only when these fields are explicit: + +```yaml +goal: +scope: +mutable_target: +locked_eval: +metric: +direction: higher|lower +verify: +guard: +iterations: +time_budget: +results_log: autoresearch-results/results.tsv +rollback: revert worse, crashing, unparsable, or guard-failing changes +greenlight: required for destructive, public, credential, account, push, deploy, merge, or bulk edits +``` + +Do not infer missing fields silently. If a field is unknown, run `autoresearch:plan` / planning mode first. + +## Iteration discipline + +Each iteration should follow this shape: + +```text +1. Read current state, prior results log, and recent git history. +2. Pick one small, falsifiable change. +3. Edit only allowed mutable targets. +4. Commit or checkpoint the candidate. +5. Run verify and guard commands. +6. Parse metric. +7. If improved and guards pass: keep. +8. If worse, equal-with-more-complexity, crashed, or guards fail: revert. +9. Append results_log. +10. Continue until iteration/time budget is exhausted. +``` + +Use simplicity as a tie-breaker: equal metric with less code/complexity may be kept; equal metric with more complexity must be reverted. + +## Required log shape + +Use TSV or JSONL. TSV default: + +```tsv +iteration commit metric delta status summary verify guard +0 baseline 42 0 baseline initial metric pass pass +1 abc123 39 -3 keep reduced failing lint count in parser pass pass +2 - 45 +6 revert broadened change broke type guard pass fail +``` + +Keep failures visible. Reverting a failed experiment is part of the evidence trail, not a problem to hide. + +## Role ownership + +- `orchestrator`: approves entering autoresearch, locks scope/eval/metric/budget, and decides whether the loop may run in durable/background mode. +- `researcher:quick`: gathers external/internal evidence and may draft the contract. +- `researcher:autoresearch`: runs the loop after the contract is complete. +- `reviewer`: checks kept changes for metric hacking, overfitting, security regressions, and hidden scope expansion. +- `qa`: replays final verification and any browser/API smoke. +- `km-agent`: promotes durable lessons/results into RAZSOC/GBrain after review. + +## Good targets for this stack + +### 1. Hermes skill optimization + +Improve one skill against fixed prompts and binary rubric checks. + +```yaml +goal: Improve reviewer-core bug catching without increasing false positives. +scope: + - /home/aleks/.hermes/skills/**/reviewer-core/SKILL.md +mutable_target: reviewer-core/SKILL.md +locked_eval: + - evals/reviewer-core/cases/*.md + - evals/reviewer-core/rubric.json +metric: rubric score out of 100 +direction: higher +verify: python evals/reviewer-core/run_eval.py --json +guard: hermes chat -Q -t reviewer:gate -q 'load reviewer-core and summarize readiness' | grep -q reviewer +iterations: 3 +``` + +### 2. Profile prompt optimization + +Tune one profile against fixed briefs. + +```yaml +goal: Make researcher choose GBrain-first lookup reliably before web search. +scope: + - /home/aleks/.hermes/profiles/researcher/SOUL.md + - /home/aleks/.hermes/profiles/researcher/skills/researcher-quick/SKILL.md +mutable_target: researcher profile guidance +locked_eval: + - evals/researcher-routing/cases.jsonl +metric: pass rate across routing cases +direction: higher +verify: python evals/researcher-routing/run_eval.py +guard: hermes chat -Q -t researcher:quick -q 'respond with mode readiness only' +iterations: 3 +``` + +### 3. GBrain retrieval routing + +Optimize route rules/prompts against known-answer fixtures. The corpus and answer key are locked. + +```yaml +goal: Improve citation-correct answers for RAZSOC/GBrain architecture questions. +scope: + - skills/note-taking/gbrain/SKILL.md + - profiles/km-agent/SOUL.md +mutable_target: retrieval/routing guidance only +locked_eval: + - evals/gbrain-routing/questions.jsonl + - evals/gbrain-routing/answers.jsonl +metric: exact-or-cited-correct score +direction: higher +verify: python evals/gbrain-routing/run_eval.py --max-cases 12 +guard: gbrain stats >/dev/null +iterations: 3 +``` + +### 4. Repo cleanup loop + +Reduce one failure class with focused guards. + +```yaml +goal: Reduce no-explicit-any count in changed TypeScript files. +scope: + - src/**/*.ts + - src/**/*.tsx +mutable_target: one module or route family per iteration +locked_eval: + - package.json + - eslint config +metric: eslint no-explicit-any violation count +direction: lower +verify: pnpm exec eslint src --format json | python scripts/count-eslint-rule.py @typescript-eslint/no-explicit-any +guard: pnpm exec vitest run +iterations: 5 +``` + +### 5. Browser/QA harness improvement + +Use only deterministic checks. + +```yaml +goal: Increase deterministic /swarm smoke coverage. +scope: + - tests/browser/swarm-smoke.* + - src/routes/**/swarm* +mutable_target: smoke test file first; product code only with explicit approval +locked_eval: + - expected role list + - API response assertions +metric: passing smoke assertions count +direction: higher +verify: pnpm exec playwright test tests/browser/swarm-smoke.spec.ts --reporter=json +guard: pnpm exec vitest run src/server/swarm-health.test.ts +iterations: 3 +``` + +## Bad targets / red flags + +Do not run autoresearch when: + +- the loop can edit the eval, dataset, scorer, or answer key +- the metric is a proxy that can be gamed easily +- the desired improvement is mostly taste or strategy +- the work touches secrets, account settings, public posting, deploys, merges, or destructive cleanup +- the scope is broad enough to rewrite the vault/repo +- the verification command is slow, flaky, or manually judged +- the agent cannot parse the metric deterministically + +Common reward-hacking examples: + +- deleting hard tests to improve pass rate +- changing a rubric/answer key instead of behavior +- caching fixture outputs instead of solving the task +- suppressing errors instead of fixing causes +- narrowing search to known examples only +- adding brittle sleeps/retries to hide flake + +## Pilot before background + +Default wedge: + +1. Run `researcher:quick` to draft the contract. +2. Run `reviewer` on the contract for metric-hacking risk. +3. Run `researcher:autoresearch` for 3 iterations foreground/durable-session only. +4. Run `reviewer` on kept diffs. +5. Run `qa` or focused verification. +6. Let `km-agent` capture only durable lessons. + +Only after a clean pilot should an orchestrator approve a longer or background loop. + +## Exit report + +Every run must finish with: + +```text +Goal: +Scope: +Metric baseline -> final: +Iterations attempted: +Kept changes: +Reverted changes: +Verification: +Guard result: +Reward-hacking review: +Remaining risks: +Next recommended loop or stop condition: +``` diff --git a/docs/swarm/README.md b/docs/swarm/README.md index fa14e783..e801300d 100644 --- a/docs/swarm/README.md +++ b/docs/swarm/README.md @@ -16,6 +16,7 @@ This is not a chat wrapper with tabs. It is the operating surface for a local ag - [QUICKSTART.md](./QUICKSTART.md) — clone, run, detect profiles, spawn workers, dispatch the first task. - [ARCHITECTURE.md](./ARCHITECTURE.md) — loop, SwarmBrief shape, notification routing, lanes, review, repair. +- [AUTORESEARCH.md](./AUTORESEARCH.md) — bounded optimization-loop contract for `researcher:autoresearch`. - [SKILLS.md](./SKILLS.md) — bundled swarm skills, auto-loading, and custom skill conventions. - [ROLES.md](./ROLES.md) — role presets used by the Add Swarm dialog and the canonical project specs. @@ -96,8 +97,9 @@ Read these in order if you are testing the v1 release: 1. [QUICKSTART.md](./QUICKSTART.md) 2. [ARCHITECTURE.md](./ARCHITECTURE.md) -3. [ROLES.md](./ROLES.md) -4. [SKILLS.md](./SKILLS.md) +3. [AUTORESEARCH.md](./AUTORESEARCH.md) +4. [ROLES.md](./ROLES.md) +5. [SKILLS.md](./SKILLS.md) ## Canonical spec diff --git a/docs/swarm/ROLES.md b/docs/swarm/ROLES.md index 6b5473a3..d276dd05 100644 --- a/docs/swarm/ROLES.md +++ b/docs/swarm/ROLES.md @@ -218,7 +218,7 @@ Canonical spec: /swarm-specs/projects/swarm4.md ``` -Sage drafts; humans approve public posting. +Sage drafts; humans approve public posting. Use normal research for evidence gathering and synthesis. Use autoresearch only for bounded optimization loops with an explicit Goal/Scope/Metric/Verify/Guard/Iterations contract; see [AUTORESEARCH.md](./AUTORESEARCH.md). ## Scribe