Swarm2 Worker Lifecycle + Compaction Spec

Date: 2026-04-28 Status: staged implementation

Problem

Swarm workers are persistent Claude agents with their own profiles, sessions, tmux panes, runtime state, and memory. That is the right architecture, but long-running workers will eventually degrade when context approaches the model window.

We need an automatic lifecycle system so workers can:

run with large context budgets,
checkpoint before quality drops,
write durable handoffs,
restart/new themselves cleanly,
resume from handoff and mission state,
keep the orchestrator informed.

Target behavior

Each worker gets a context policy:

soft limit: 250k tokens, request concise checkpoint soon
handoff limit: 400k tokens, write full handoff before more work
hard limit: 500k tokens, stop accepting new work until renewed

Exact numbers should be configurable per model/profile, but default policy should be safe.

Lifecycle states

healthy: under soft limit
watch: over soft limit, continue but monitor
handoff_required: over handoff limit, ask worker to write handoff
renew_required: over hard limit or repeatedly stale/fragmented
renewing: handoff was requested and tmux/session is being restarted
blocked: renewal failed or handoff missing

Handoff contract

Before renewal, worker must write or return a handoff containing:

STATE: HANDOFF
FILES_CHANGED: ...
COMMANDS_RUN: ...
RESULT: current state and what landed
BLOCKER: blocker or none
NEXT_ACTION: exact next step after renewal

Durable handoff path:

/Users/aurora/.openclaw/workspace/memory/handoffs/swarm/<workerId>-latest.md

Optional timestamped archive can exist later, but latest.md is the resume source.

Renewal sequence

Detect context pressure from Claude state.db session token counts.
Ask worker for handoff via tmux dispatch.
Parse handoff checkpoint from chat.
Save handoff into durable memory path.
Stop worker tmux session.
Start clean Claude session with same profile and cwd.
Send resume prompt containing handoff summary + active mission assignment.
Mark runtime state healthy/executing.

Product requirements

Swarm2 UI should show:

context state per worker
current session token estimate
lifecycle status
last handoff time
renew button
automatic renewal status

Safety rules

Never auto-renew while worker is actively writing unless hard limit is reached.
Never start destructive execution after renewal without mission policy allowing it.
Handoff must be complete before restart unless human forces renewal.
If handoff parse fails, mark worker blocked and ask orchestrator/human.

Stage 1 implementation

Add lifecycle status API.
Read latest session token counts from state.db.
Return lifecycle state and recommended action.
Add request-handoff action that sends a strict handoff prompt to tmux.
Add renew action, but require force: true for now.
Normalize swarm wrappers to use /Users/aurora/hermes-workspace cwd.

Stage 2

Parse handoff checkpoints into durable handoff files.
Add runtime.json fields for contextTokens, contextState, lastHandoffAt.
Add Swarm2 UI lifecycle badges.

Stage 3

Automatic renewal loop.
Resume prompt with mission state.
Per-model context policies.

3.3 KiB Raw Permalink Blame History