Capacity + integrity

LumenFlow treats capacity and integrity as first-class orchestration state — not as something every agent has to re-derive from git status. This page covers the knobs you set and the states the reconciler emits.

Capacity: `--max-active-workers`

By default, LumenFlow will launch every eligible WU it finds — subject to a lane WIP=1 per wave cap. When you want to limit concurrency (context budget, API limits, reviewer bandwidth), cap the orchestrator:

# Cap concurrent active workers at 2 for one invocation.
pnpm orchestrate:initiative -i INIT-056 -c --max-active-workers 2

You can make the cap permanent in workspace.yaml:

software_delivery:
  orchestration:
    max_active_workers: 2

The CLI flag wins over the config when both are present. A value of 0 means “pause launches entirely” — existing work keeps running; nothing new gets handed off.

Behaviour when the cap binds

WUs that would otherwise be eligible receive launchBlockedBy: ['capacity'].
The next_safe_actions list in status.json surfaces a wait action with "Queued until worker capacity frees (remaining capacity: N)."
When remaining capacity is 0 and there are queued WUs, the reconciler emits an aggregate wait action naming everything queued.

Integrity: contamination detection

Main-checkout contamination is when a delegated WU’s declared code_paths overlap with files that are dirty in the root checkout (not in the WU’s worktree). That is a strong signal something has been edited outside the worktree boundary — the single most common source of lost work.

LumenFlow runs git status --porcelain=v1 in the project root and intersects the dirty set against each WU’s code_paths on every orchestrate:initiative -c and orchestrate:init-status call.

If any overlap is found:

The WU’s orchestration_state becomes contaminated.
status.json sets blocked_by_integrity: true.
The reconciler’s first next_safe_action is recover_wu for the contaminated WU.
buildCheckpointWave refuses to advance further — even the eligible waves that are logically independent get queued behind recover_wu.

Directory-level matching

A code_path that ends in / is treated as a directory. Any dirty file under that directory counts as contamination of that code_path:

# WU YAML
code_paths:
  - packages/@lumenflow/cli/src/ # directory
  - packages/@lumenflow/mcp/src/tools/orchestration-contract.ts # single file

Integrity: stall detection

A stalled worktree is one that is still on disk but has not produced any activity (checkpoint, signal, or claim update) in a long time. The default threshold is 4 hours; override via workspace.yaml:

software_delivery:
  orchestration:
    stall_threshold_hours: 8 # allow longer-running worktrees

A WU with no worktree on disk is never stalled — that is the needs_relaunch branch instead.

Activity signals considered

The detector looks at (in order, taking the most recent):

The latest shared-memory checkpoint for the WU.
The latest signal broadcast for the WU.
The WU’s claimed_at timestamp from its YAML spec.

If all three are older than the threshold and a worktree still exists, the WU transitions to stalled with a recover_wu action. A human or agent can then inspect the worktree, merge in-flight work, or block the WU with a concrete reason.

Putting it together

With both detectors wired in, reconciliation stops advancing whenever integrity is suspect, even if there is free capacity and dependencies are clear:

$ pnpm orchestrate:initiative -i INIT-056 -c

[orchestrate:initiative] Loaded 7 WU(s)
Progress: [██████████████░░░░░░] 70%
  Done: 4/7
  Active: 1
  Pending: 0
  Blocked: 2   # contaminated + stalled

Wave -1 manifest: null
Available capacity: 2

Next Safe Actions:
  - recover_wu WU-2634: main checkout contamination detected
  - recover_wu WU-2635: delegated work appears stalled

Unsafe to advance while orchestration integrity issues remain.

The orchestrator will keep refusing to launch new work until those two recover_wu actions are resolved.

State machine — where contaminated, stalled, needs_relaunch, and queued_by_capacity sit in the vocabulary.
Lifecycle — the reconciliation phase that fires these detectors.
Control-plane SDK — the shared types that record capacity + integrity state.