RECAP on YAM

End-to-end documentation for our RECAP (RL with Experience and Corrections via Advantage-conditioned Policies) implementation on YAM bimanual arms. RECAP is the offline RL algorithm in pi0.6 (π★₀.₆: a VLA That Learns From Experience, Physical Intelligence et al.).

This site documents the full pipeline we actually run on real hardware:

  • Data collection in limb — DAgger sessions with a three-state phase machine (AUTONOMOUS / PAUSED / CORRECTING) and operator-driven episode lifecycle.

  • Data conversion via limb convert-lerobot --pistar — produces a LeRobot v3.0 dataset with the five RECAP columns (intervention, reward, reward_label, value_label, adv_ind).

  • Training in pistar (JAX) — six stages from SFT through full RECAP, with 15 patches we wrote on top of upstream pistar to make Stages 4 + 5 (VLM value model + VLM advantage labeling) actually runnable.

  • Evaluation through openpi/scripts/serve_policy.py and limb’s OpenPIClient — the trained pi0.6 checkpoint serves through the standard openpi wire protocol with no limb-side changes.

For the algorithm itself read the pi★0.6 paper. For the reference RECAP pipeline structure see the RLinf RECAP page; RLinf is sim-only (LIBERO), while this site documents a real-robot implementation.

Quick-start path

The shortest path from a fresh checkout to a working pi0.6 checkpoint on YAM. Each link goes to a dedicated page with full commands.

What’s adapted from where

Component

Origin

Adaptation for YAM

Algorithm

pi0.6 / RECAP paper

unchanged

Code base

ybpy/pistar (JAX fork of openpi)

added YAM TrainConfigs, Aloha-data adv_ind passthrough, 13 upstream-bug patches

Collection stack

limb (YAM control)

added DAgger session lifecycle, 6 pistar-shaped converter helpers, --pistar / --pistar-demo flags

Policy serving

openpi serve_policy.py

unchanged — pi0.6 checkpoints serve through it natively (no CFG-sampler shim needed)

VLM value model checkpoint

ybpy/vlm_ckpt (HF) / Google Drive

unchanged

Why pistar, not RLinf

Both pistar and RLinf implement pi0.6 / RECAP and use the same value model (SigLIP-So400m + Gemma3-270M + 201-bin C51 head over [-1, 0]). The difference is the labeling pipeline and the validation regime:

Dimension

RLinf

pistar

Backend

PyTorch

JAX

Relation to openpi

vendors openpi

fork of openpi

Validation

LIBERO simulation only

real robot (SO-101, AgileX PiPER)

Advantage labeling

quantile from value model with no auxiliary labels

VLM-based value_label / reward_label (per-frame supervision)

Conditioning at serving

CFG sampler required

adv_ind is just a tokenizer input — vanilla serve_policy.py works

For real-robot YAM the pistar path is a closer fit. It is also upstream-broken on main for Stages 4 + 5; we made them work with 15 targeted patches documented on the patches page.