Stage 0 — DAgger collection (limb)

Collect rollouts where a served pi0.5 (Stage 2) or pi0.6 (Stage 3 / 6) policy drives the YAM autonomously and the operator intervenes via bilateral teleop. Every frame is timestamped and tagged with the current phase (autonomous / paused / correcting) and per-episode success/failure.

Prerequisites

A served policy on 0.0.0.0:8111 (see Stage 2 / Stage 6 for serving commands).
All four YAM arms initialized cleanly (per the setup diagnostics).
The iKKEGOL pedal connected and pgrep -fa multiprocessing showing no leftover orphans from a prior crashed run (CAN devices can’t be shared across processes).

The command

cd ~/limb
source .venv/bin/activate
uv run limb record --config-path \
  configs/yam_dagger_pi0_bimanual.yaml \
  configs/dagger_collection.yaml

Two YAML overlays:

configs/yam_dagger_pi0_bimanual.yaml — DAgger agent (phase machine wrapping the pi0.x policy, bilateral teleop in CORRECTING) + 4-arm robot config + 3 RealSense cameras.
configs/dagger_collection.yaml — Recording session (continuous per-frame recording, keyboard trigger for the episode lifecycle, num_episodes: 100, episode_duration_s: 200).

Phase machine (foot pedal)

The DAgger agent has three phases, transitioned by the pedal:

              left pedal              right pedal
   AUTONOMOUS <───────────> PAUSED <───────────> CORRECTING
   (policy drives)           (everyone holds)    (operator drives
                                                  via leaders;
                                                  followers track)

You cannot go AUTONOMOUS↔CORRECTING directly — always through PAUSED. This guarantees the operator is positioned before control changes hands.

Pedal	Effect (from current phase)
Left	`AUTONOMOUS` ↔ `PAUSED`
Right	`PAUSED` ↔ `CORRECTING`

The agent boots in initial_phase: paused so the policy doesn’t start moving the moment the stack launches.

Episode lifecycle (keyboard)

A separate keyboard trigger drives the recording session — episode start / save / discard. This avoids contention with the pedal (the pedal trigger grabs the iKKEGOL exclusively).

Key	Effect
`SPACE` (between episodes)	start the next episode (recording begins)
`s`	end the current episode, save with `SUCCESS` marker
`SPACE` (during episode)	end the current episode, save with `FAILURE` marker (policy missed)
`d`	discard the current episode (delete from disk)
`q`	quit the session

A typical episode:

(Between episodes) — stage the scene; SPACE to start recording.
Robot is in PAUSED → optionally right pedal → CORRECTING, drive followers to the start pose via the leaders, right pedal back to PAUSED.
Left pedal → AUTONOMOUS; policy attempts the task.
If it drifts: left pedal (PAUSED) → right pedal (CORRECTING), bilaterally teleop back on-task, right pedal (PAUSED) → left pedal (AUTONOMOUS).
s if the policy completed the task, SPACE if it failed.

What gets recorded

Per episode under recordings/<task>_<ts>/episode_<...>/:

File	What it is
`{arm}_actions.npz` (`pos`)	commanded action that frame (operator’s during CORRECTING)
`{arm}_states.npz`	observed state (joints + gripper)
`{arm}_policy_actions.npz` (`pos`)	what the policy would have produced (shadow stream)
`{cam}.mp4` + `{cam}_timestamps.npy`	per-camera video + timestamps
`phase.npy`	per-tick phase string (`autonomous`/`paused`/`correcting`)
`interventions.npy`	per-tick `intervention` bool (legacy column)
`correction_index.npy`	per-tick id grouping consecutive CORRECTING frames
`timestamps.npy`	per-tick control-loop timestamp
`SUCCESS` or `FAILURE`	the marker you wrote with `s` / `SPACE`
`metadata.json`	task instruction, arm names, cameras, etc.

policy_action (the policy’s shadow output even during CORRECTING) is the key new stream needed for RECAP — see Stage 1 for how it gets surfaced.

Hygiene rules

Keep both success and failure episodes. RECAP’s value model needs both classes to converge. ~30–60% success is a healthy band for productive correction collection. We’ve observed ~30% in practice on small datasets.
One task instruction per session. Pistar’s value model and percentile labeling are per-task; mixing breaks the percentile.
Label honestly. The SUCCESS/FAILURE marker drives reward, value_label, and downstream advantage signs. A mislabeled episode poisons the value model directly.
Don’t filter “boring” autonomous successes out. Stage 5 will classify those frames as positive for free — they’re useful training signal.

Scale guidance (from the pi0.6 paper)

The π★₀.₆ paper, Appendix A-F, reports per-task collection counts. There is no scaling curve and no explicit “enough” criterion; what they actually did:

Task	Demos	Autonomous	Correction episodes	Iterations
Laundry (diverse)	—	450	287	one set
Espresso / café	—	414	429	per iter
Box assembly	600	—	360	× 2 iters

For a small-scale smoke test the Stage 3 LoRA path runs on ~10 episodes; for genuine RECAP improvement at paper scale aim for ~300 correction episodes per iteration.

A practical stopping signal: when the latest batch of episodes shows intervention rate < 10% (the policy is succeeding on its own), stop collecting and start a new iteration.

Reference dataset

The reference dataset shipped with this site is at datasets/vial_rollout_v1_v21/ (10 episodes, 21,286 frames @ 30 fps, 3 SUCCESS / 7 FAILURE, 32.7% intervention rate). It’s enough to validate the full pipeline; it is not enough to produce a RECAP-improved policy.

Next

Convert the raw episodes → Stage 1.