RECAP on YAM

End-to-end documentation for our RECAP (RL with Experience and Corrections via Advantage-conditioned Policies) implementation on YAM bimanual arms. RECAP is the offline RL algorithm in pi0.6 (π★₀.₆: a VLA That Learns From Experience, Physical Intelligence et al.).

This site documents the full pipeline we actually run on real hardware:

Data collection in limb — DAgger sessions with a three-state phase machine (AUTONOMOUS / PAUSED / CORRECTING) and operator-driven episode lifecycle.
Data conversion via limb convert-lerobot --pistar — produces a LeRobot v3.0 dataset with the five RECAP columns (intervention, reward, reward_label, value_label, adv_ind).
Training in pistar (JAX) — six stages from SFT through full RECAP, with 15 patches we wrote on top of upstream pistar to make Stages 4 + 5 (VLM value model + VLM advantage labeling) actually runnable.
Evaluation through openpi/scripts/serve_policy.py and limb’s OpenPIClient — the trained pi0.6 checkpoint serves through the standard openpi wire protocol with no limb-side changes.

For the algorithm itself read the pi★0.6 paper. For the reference RECAP pipeline structure see the RLinf RECAP page; RLinf is sim-only (LIBERO), while this site documents a real-robot implementation.

Quick-start path

The shortest path from a fresh checkout to a working pi0.6 checkpoint on YAM. Each link goes to a dedicated page with full commands.

Pipeline stages

Reference

Patches reference — making pistar Stage 4 / 5 actually run

What’s adapted from where

Component	Origin	Adaptation for YAM
Algorithm	pi0.6 / RECAP paper	unchanged
Code base	ybpy/pistar (JAX fork of openpi)	added YAM `TrainConfig`s, Aloha-data adv_ind passthrough, 13 upstream-bug patches
Collection stack	limb (YAM control)	added DAgger session lifecycle, 6 pistar-shaped converter helpers, `--pistar` / `--pistar-demo` flags
Policy serving	openpi `serve_policy.py`	unchanged — pi0.6 checkpoints serve through it natively (no CFG-sampler shim needed)
VLM value model checkpoint	ybpy/vlm_ckpt (HF) / Google Drive	unchanged

Why pistar, not RLinf

Both pistar and RLinf implement pi0.6 / RECAP and use the same value model (SigLIP-So400m + Gemma3-270M + 201-bin C51 head over [-1, 0]). The difference is the labeling pipeline and the validation regime:

Dimension	RLinf	pistar
Backend	PyTorch	JAX
Relation to openpi	vendors openpi	fork of openpi
Validation	LIBERO simulation only	real robot (SO-101, AgileX PiPER)
Advantage labeling	quantile from value model with no auxiliary labels	VLM-based `value_label` / `reward_label` (per-frame supervision)
Conditioning at serving	CFG sampler required	`adv_ind` is just a tokenizer input — vanilla `serve_policy.py` works

For real-robot YAM the pistar path is a closer fit. It is also upstream-broken on main for Stages 4 + 5; we made them work with 15 targeted patches documented on the patches page.