3.3 KiB
3.3 KiB
Swarm2 Worker Lifecycle + Compaction Spec
Date: 2026-04-28 Status: staged implementation
Problem
Swarm workers are persistent Claude agents with their own profiles, sessions, tmux panes, runtime state, and memory. That is the right architecture, but long-running workers will eventually degrade when context approaches the model window.
We need an automatic lifecycle system so workers can:
- run with large context budgets,
- checkpoint before quality drops,
- write durable handoffs,
- restart/new themselves cleanly,
- resume from handoff and mission state,
- keep the orchestrator informed.
Target behavior
Each worker gets a context policy:
- soft limit: 250k tokens, request concise checkpoint soon
- handoff limit: 400k tokens, write full handoff before more work
- hard limit: 500k tokens, stop accepting new work until renewed
Exact numbers should be configurable per model/profile, but default policy should be safe.
Lifecycle states
healthy: under soft limitwatch: over soft limit, continue but monitorhandoff_required: over handoff limit, ask worker to write handoffrenew_required: over hard limit or repeatedly stale/fragmentedrenewing: handoff was requested and tmux/session is being restartedblocked: renewal failed or handoff missing
Handoff contract
Before renewal, worker must write or return a handoff containing:
STATE: HANDOFF
FILES_CHANGED: ...
COMMANDS_RUN: ...
RESULT: current state and what landed
BLOCKER: blocker or none
NEXT_ACTION: exact next step after renewal
Durable handoff path:
/Users/aurora/.openclaw/workspace/memory/handoffs/swarm/<workerId>-latest.md
Optional timestamped archive can exist later, but latest.md is the resume source.
Renewal sequence
- Detect context pressure from Claude
state.dbsession token counts. - Ask worker for handoff via tmux dispatch.
- Parse handoff checkpoint from chat.
- Save handoff into durable memory path.
- Stop worker tmux session.
- Start clean Claude session with same profile and cwd.
- Send resume prompt containing handoff summary + active mission assignment.
- Mark runtime state healthy/executing.
Product requirements
Swarm2 UI should show:
- context state per worker
- current session token estimate
- lifecycle status
- last handoff time
- renew button
- automatic renewal status
Safety rules
- Never auto-renew while worker is actively writing unless hard limit is reached.
- Never start destructive execution after renewal without mission policy allowing it.
- Handoff must be complete before restart unless human forces renewal.
- If handoff parse fails, mark worker
blockedand ask orchestrator/human.
Stage 1 implementation
- Add lifecycle status API.
- Read latest session token counts from
state.db. - Return lifecycle state and recommended action.
- Add request-handoff action that sends a strict handoff prompt to tmux.
- Add renew action, but require
force: truefor now. - Normalize swarm wrappers to use
/Users/aurora/hermes-workspacecwd.
Stage 2
- Parse handoff checkpoints into durable handoff files.
- Add runtime.json fields for
contextTokens,contextState,lastHandoffAt. - Add Swarm2 UI lifecycle badges.
Stage 3
- Automatic renewal loop.
- Resume prompt with mission state.
- Per-model context policies.