Stage 0 — DAgger collection (limb)
Collect rollouts where a served pi0.5 (Stage 2) or pi0.6 (Stage 3 / 6)
policy drives the YAM autonomously and the operator intervenes via
bilateral teleop. Every frame is timestamped and tagged with the
current phase (autonomous / paused / correcting) and per-episode
success/failure.
Prerequisites
A served policy on
0.0.0.0:8111(see Stage 2 / Stage 6 for serving commands).All four YAM arms initialized cleanly (per the setup diagnostics).
The iKKEGOL pedal connected and
pgrep -fa multiprocessingshowing no leftover orphans from a prior crashed run (CAN devices can’t be shared across processes).
The command
cd ~/limb
source .venv/bin/activate
uv run limb record --config-path \
configs/yam_dagger_pi0_bimanual.yaml \
configs/dagger_collection.yaml
Two YAML overlays:
configs/yam_dagger_pi0_bimanual.yaml— DAgger agent (phase machine wrapping the pi0.x policy, bilateral teleop in CORRECTING) + 4-arm robot config + 3 RealSense cameras.configs/dagger_collection.yaml— Recording session (continuous per-frame recording, keyboard trigger for the episode lifecycle,num_episodes: 100,episode_duration_s: 200).
Phase machine (foot pedal)
The DAgger agent has three phases, transitioned by the pedal:
left pedal right pedal
AUTONOMOUS <───────────> PAUSED <───────────> CORRECTING
(policy drives) (everyone holds) (operator drives
via leaders;
followers track)
You cannot go AUTONOMOUS↔CORRECTING directly — always through PAUSED. This guarantees the operator is positioned before control changes hands.
Pedal |
Effect (from current phase) |
|---|---|
Left |
|
Right |
|
The agent boots in initial_phase: paused so the policy doesn’t start
moving the moment the stack launches.
Episode lifecycle (keyboard)
A separate keyboard trigger drives the recording session — episode start / save / discard. This avoids contention with the pedal (the pedal trigger grabs the iKKEGOL exclusively).
Key |
Effect |
|---|---|
|
start the next episode (recording begins) |
|
end the current episode, save with |
|
end the current episode, save with |
|
discard the current episode (delete from disk) |
|
quit the session |
A typical episode:
(Between episodes) — stage the scene;
SPACEto start recording.Robot is in PAUSED → optionally right pedal → CORRECTING, drive followers to the start pose via the leaders, right pedal back to PAUSED.
Left pedal → AUTONOMOUS; policy attempts the task.
If it drifts: left pedal (PAUSED) → right pedal (CORRECTING), bilaterally teleop back on-task, right pedal (PAUSED) → left pedal (AUTONOMOUS).
sif the policy completed the task,SPACEif it failed.
What gets recorded
Per episode under recordings/<task>_<ts>/episode_<...>/:
File |
What it is |
|---|---|
|
commanded action that frame (operator’s during CORRECTING) |
|
observed state (joints + gripper) |
|
what the policy would have produced (shadow stream) |
|
per-camera video + timestamps |
|
per-tick phase string ( |
|
per-tick |
|
per-tick id grouping consecutive CORRECTING frames |
|
per-tick control-loop timestamp |
|
the marker you wrote with |
|
task instruction, arm names, cameras, etc. |
policy_action (the policy’s shadow output even during CORRECTING) is
the key new stream needed for RECAP — see
Stage 1 for how it gets surfaced.
Hygiene rules
Keep both success and failure episodes. RECAP’s value model needs both classes to converge. ~30–60% success is a healthy band for productive correction collection. We’ve observed ~30% in practice on small datasets.
One task instruction per session. Pistar’s value model and percentile labeling are per-task; mixing breaks the percentile.
Label honestly. The
SUCCESS/FAILUREmarker drivesreward,value_label, and downstream advantage signs. A mislabeled episode poisons the value model directly.Don’t filter “boring” autonomous successes out. Stage 5 will classify those frames as
positivefor free — they’re useful training signal.
Scale guidance (from the pi0.6 paper)
The π★₀.₆ paper, Appendix A-F, reports per-task collection counts. There is no scaling curve and no explicit “enough” criterion; what they actually did:
Task |
Demos |
Autonomous |
Correction episodes |
Iterations |
|---|---|---|---|---|
Laundry (diverse) |
— |
450 |
287 |
one set |
Espresso / café |
— |
414 |
429 |
per iter |
Box assembly |
600 |
— |
360 |
× 2 iters |
For a small-scale smoke test the Stage 3 LoRA path runs on ~10 episodes; for genuine RECAP improvement at paper scale aim for ~300 correction episodes per iteration.
A practical stopping signal: when the latest batch of episodes shows intervention rate < 10% (the policy is succeeding on its own), stop collecting and start a new iteration.
Reference dataset
The reference dataset shipped with this site is at
datasets/vial_rollout_v1_v21/ (10 episodes, 21,286 frames @ 30 fps,
3 SUCCESS / 7 FAILURE, 32.7% intervention rate). It’s enough to
validate the full pipeline; it is not enough to produce a
RECAP-improved policy.
Next
Convert the raw episodes → Stage 1.