RECAP on YAM
End-to-end documentation for our RECAP (RL with Experience and Corrections via Advantage-conditioned Policies) implementation on YAM bimanual arms. RECAP is the offline RL algorithm in pi0.6 (π★₀.₆: a VLA That Learns From Experience, Physical Intelligence et al.).
This site documents the full pipeline we actually run on real hardware:
Data collection in
limb— DAgger sessions with a three-state phase machine (AUTONOMOUS / PAUSED / CORRECTING) and operator-driven episode lifecycle.Data conversion via
limb convert-lerobot --pistar— produces a LeRobot v3.0 dataset with the five RECAP columns (intervention,reward,reward_label,value_label,adv_ind).Training in pistar (JAX) — six stages from SFT through full RECAP, with 15 patches we wrote on top of upstream pistar to make Stages 4 + 5 (VLM value model + VLM advantage labeling) actually runnable.
Evaluation through
openpi/scripts/serve_policy.pyand limb’sOpenPIClient— the trained pi0.6 checkpoint serves through the standard openpi wire protocol with no limb-side changes.
For the algorithm itself read the pi★0.6 paper. For the reference RECAP pipeline structure see the RLinf RECAP page; RLinf is sim-only (LIBERO), while this site documents a real-robot implementation.
Quick-start path
The shortest path from a fresh checkout to a working pi0.6 checkpoint on YAM. Each link goes to a dedicated page with full commands.
Pipeline stages
- Overview
- Setup
- Stage 0 — DAgger collection (limb)
- Stage 1 — Convert to LeRobot v2.1 with five RECAP columns
- Stage 2 — Initial SFT (openpi, full fine-tune)
- Stage 3 — pi0.6 fine-tune from SFT (no VLM yet)
- Stage 4 — VLM value model training (pistar, patched)
- Stage 5 — Advantage labeling (VLM relabel of
adv_ind) - Stage 6 — Full RECAP fine-tune
- Evaluation — Serve + deploy via limb
What’s adapted from where
Component |
Origin |
Adaptation for YAM |
|---|---|---|
Algorithm |
pi0.6 / RECAP paper |
unchanged |
Code base |
ybpy/pistar (JAX fork of openpi) |
added YAM |
Collection stack |
limb (YAM control) |
added DAgger session lifecycle, 6 pistar-shaped converter helpers, |
Policy serving |
unchanged — pi0.6 checkpoints serve through it natively (no CFG-sampler shim needed) |
|
VLM value model checkpoint |
ybpy/vlm_ckpt (HF) / Google Drive |
unchanged |
Why pistar, not RLinf
Both pistar and RLinf implement pi0.6 / RECAP and use the same value
model (SigLIP-So400m + Gemma3-270M + 201-bin C51 head over [-1, 0]).
The difference is the labeling pipeline and the validation regime:
Dimension |
RLinf |
pistar |
|---|---|---|
Backend |
PyTorch |
JAX |
Relation to openpi |
vendors openpi |
fork of openpi |
Validation |
LIBERO simulation only |
real robot (SO-101, AgileX PiPER) |
Advantage labeling |
quantile from value model with no auxiliary labels |
VLM-based |
Conditioning at serving |
CFG sampler required |
|
For real-robot YAM the pistar path is a closer fit. It is also
upstream-broken on main for Stages 4 + 5; we made them work with 15
targeted patches documented on the patches page.