8.0 KiB
Autoresearch Mode
Autoresearch is a bounded optimization harness for Hermes Agents. It is not the default research workflow.
Use it only when the system can mechanically decide whether an iteration improved.
normal research = gather evidence -> synthesize -> recommend
autoresearch mode = mutate one target -> verify metric -> keep/revert -> repeat
Source pattern
The useful pattern from Karpathy-style autoresearch and downstream Claude/Codex ports is stable:
- Lock the scope.
- Lock the evaluation surface.
- Pick one scalar metric.
- Mutate one narrow target.
- Run a mechanical verifier.
- Keep improvements.
- Revert worse/crashing/guard-failing changes.
- Log every iteration.
- Stop at the configured budget.
If you cannot evaluate it mechanically, do not autoresearch it.
When to use researcher:quick
Use normal researcher mode for:
- web/GitHub/X/Reddit/Medium/YouTube/source collection
- market/model/library scans
- literature review
- qualitative synthesis
- tradeoff notes
- recommendations where judgment matters
researcher:quick may produce an autoresearch config, but it should not start the loop unless the contract below is filled.
Autoresearch entry contract
A loop may start only when these fields are explicit:
goal: <one sentence outcome>
scope: <files/directories/knobs the loop may edit>
mutable_target: <specific file, skill, prompt, or narrow directory>
locked_eval: <files/datasets/scoring scripts the loop may not edit>
metric: <scalar number and unit>
direction: higher|lower
verify: <command that emits or lets us parse the metric>
guard: <command(s) that must keep passing>
iterations: <bounded count; default pilot is 3-5>
time_budget: <optional wall-clock cap>
results_log: autoresearch-results/results.tsv
rollback: revert worse, crashing, unparsable, or guard-failing changes
greenlight: required for destructive, public, credential, account, push, deploy, merge, or bulk edits
Do not infer missing fields silently. If a field is unknown, run autoresearch:plan / planning mode first.
Iteration discipline
Each iteration should follow this shape:
1. Read current state, prior results log, and recent git history.
2. Pick one small, falsifiable change.
3. Edit only allowed mutable targets.
4. Commit or checkpoint the candidate.
5. Run verify and guard commands.
6. Parse metric.
7. If improved and guards pass: keep.
8. If worse, equal-with-more-complexity, crashed, or guards fail: revert.
9. Append results_log.
10. Continue until iteration/time budget is exhausted.
Use simplicity as a tie-breaker: equal metric with less code/complexity may be kept; equal metric with more complexity must be reverted.
Required log shape
Use TSV or JSONL. TSV default:
iteration commit metric delta status summary verify guard
0 baseline 42 0 baseline initial metric pass pass
1 abc123 39 -3 keep reduced failing lint count in parser pass pass
2 - 45 +6 revert broadened change broke type guard pass fail
Keep failures visible. Reverting a failed experiment is part of the evidence trail, not a problem to hide.
Role ownership
orchestrator: approves entering autoresearch, locks scope/eval/metric/budget, and decides whether the loop may run in durable/background mode.researcher:quick: gathers external/internal evidence and may draft the contract.researcher:autoresearch: runs the loop after the contract is complete.reviewer: checks kept changes for metric hacking, overfitting, security regressions, and hidden scope expansion.qa: replays final verification and any browser/API smoke.km-agent: promotes durable lessons/results into RAZSOC/GBrain after review.
Good targets for this stack
1. Hermes skill optimization
Improve one skill against fixed prompts and binary rubric checks.
goal: Improve reviewer-core bug catching without increasing false positives.
scope:
- /home/aleks/.hermes/skills/**/reviewer-core/SKILL.md
mutable_target: reviewer-core/SKILL.md
locked_eval:
- evals/reviewer-core/cases/*.md
- evals/reviewer-core/rubric.json
metric: rubric score out of 100
direction: higher
verify: python evals/reviewer-core/run_eval.py --json
guard: hermes chat -Q -t reviewer:gate -q 'load reviewer-core and summarize readiness' | grep -q reviewer
iterations: 3
2. Profile prompt optimization
Tune one profile against fixed briefs.
goal: Make researcher choose GBrain-first lookup reliably before web search.
scope:
- /home/aleks/.hermes/profiles/researcher/SOUL.md
- /home/aleks/.hermes/profiles/researcher/skills/researcher-quick/SKILL.md
mutable_target: researcher profile guidance
locked_eval:
- evals/researcher-routing/cases.jsonl
metric: pass rate across routing cases
direction: higher
verify: python evals/researcher-routing/run_eval.py
guard: hermes chat -Q -t researcher:quick -q 'respond with mode readiness only'
iterations: 3
3. GBrain retrieval routing
Optimize route rules/prompts against known-answer fixtures. The corpus and answer key are locked.
goal: Improve citation-correct answers for RAZSOC/GBrain architecture questions.
scope:
- skills/note-taking/gbrain/SKILL.md
- profiles/km-agent/SOUL.md
mutable_target: retrieval/routing guidance only
locked_eval:
- evals/gbrain-routing/questions.jsonl
- evals/gbrain-routing/answers.jsonl
metric: exact-or-cited-correct score
direction: higher
verify: python evals/gbrain-routing/run_eval.py --max-cases 12
guard: gbrain stats >/dev/null
iterations: 3
4. Repo cleanup loop
Reduce one failure class with focused guards.
goal: Reduce no-explicit-any count in changed TypeScript files.
scope:
- src/**/*.ts
- src/**/*.tsx
mutable_target: one module or route family per iteration
locked_eval:
- package.json
- eslint config
metric: eslint no-explicit-any violation count
direction: lower
verify: pnpm exec eslint src --format json | python scripts/count-eslint-rule.py @typescript-eslint/no-explicit-any
guard: pnpm exec vitest run <focused-tests>
iterations: 5
5. Browser/QA harness improvement
Use only deterministic checks.
goal: Increase deterministic /swarm smoke coverage.
scope:
- tests/browser/swarm-smoke.*
- src/routes/**/swarm*
mutable_target: smoke test file first; product code only with explicit approval
locked_eval:
- expected role list
- API response assertions
metric: passing smoke assertions count
direction: higher
verify: pnpm exec playwright test tests/browser/swarm-smoke.spec.ts --reporter=json
guard: pnpm exec vitest run src/server/swarm-health.test.ts
iterations: 3
Bad targets / red flags
Do not run autoresearch when:
- the loop can edit the eval, dataset, scorer, or answer key
- the metric is a proxy that can be gamed easily
- the desired improvement is mostly taste or strategy
- the work touches secrets, account settings, public posting, deploys, merges, or destructive cleanup
- the scope is broad enough to rewrite the vault/repo
- the verification command is slow, flaky, or manually judged
- the agent cannot parse the metric deterministically
Common reward-hacking examples:
- deleting hard tests to improve pass rate
- changing a rubric/answer key instead of behavior
- caching fixture outputs instead of solving the task
- suppressing errors instead of fixing causes
- narrowing search to known examples only
- adding brittle sleeps/retries to hide flake
Pilot before background
Default wedge:
- Run
researcher:quickto draft the contract. - Run
revieweron the contract for metric-hacking risk. - Run
researcher:autoresearchfor 3 iterations foreground/durable-session only. - Run
revieweron kept diffs. - Run
qaor focused verification. - Let
km-agentcapture only durable lessons.
Only after a clean pilot should an orchestrator approve a longer or background loop.
Exit report
Every run must finish with:
Goal:
Scope:
Metric baseline -> final:
Iterations attempted:
Kept changes:
Reverted changes:
Verification:
Guard result:
Reward-hacking review:
Remaining risks:
Next recommended loop or stop condition: