# RECAP on YAM End-to-end documentation for our **RECAP** (RL with Experience and Corrections via Advantage-conditioned Policies) implementation on YAM bimanual arms. RECAP is the offline RL algorithm in **pi0.6** ([π★₀.₆: a VLA That Learns From Experience](https://arxiv.org/abs/2511.14759), Physical Intelligence et al.). This site documents the *full pipeline we actually run* on real hardware: - **Data collection** in [`limb`](https://github.com/TToTMooN/limb) — DAgger sessions with a three-state phase machine (AUTONOMOUS / PAUSED / CORRECTING) and operator-driven episode lifecycle. - **Data conversion** via `limb convert-lerobot --pistar` — produces a LeRobot v3.0 dataset with the five RECAP columns (`intervention`, `reward`, `reward_label`, `value_label`, `adv_ind`). - **Training** in [pistar](https://github.com/ybpy/pistar) (JAX) — six stages from SFT through full RECAP, with **15 patches** we wrote on top of upstream pistar to make Stages 4 + 5 (VLM value model + VLM advantage labeling) actually runnable. - **Evaluation** through `openpi/scripts/serve_policy.py` and limb's `OpenPIClient` — the trained pi0.6 checkpoint serves through the standard openpi wire protocol with no limb-side changes. For the algorithm itself read the [pi★0.6 paper](https://arxiv.org/abs/2511.14759). For the reference RECAP pipeline structure see the [RLinf RECAP page](https://rlinf.readthedocs.io/en/latest/rst_source/examples/embodied/recap.html); RLinf is sim-only (LIBERO), while this site documents a real-robot implementation. ## Quick-start path The shortest path from a fresh checkout to a working pi0.6 checkpoint on YAM. Each link goes to a dedicated page with full commands. ```{toctree} :maxdepth: 1 :caption: Pipeline stages overview setup stage0_collection stage1_conversion stage2_sft stage3_lora stage4_value stage5_advantage stage6_recap evaluation ``` ```{toctree} :maxdepth: 1 :caption: Reference patches ``` ## What's adapted from where | Component | Origin | Adaptation for YAM | |----------------------------|-------------------------------------|--------------------| | Algorithm | pi0.6 / RECAP paper | unchanged | | Code base | [ybpy/pistar](https://github.com/ybpy/pistar) (JAX fork of openpi) | added YAM `TrainConfig`s, Aloha-data adv_ind passthrough, 13 upstream-bug patches | | Collection stack | [limb](https://github.com/TToTMooN/limb) (YAM control) | added DAgger session lifecycle, 6 pistar-shaped converter helpers, `--pistar` / `--pistar-demo` flags | | Policy serving | [openpi `serve_policy.py`](https://github.com/Physical-Intelligence/openpi) | unchanged — pi0.6 checkpoints serve through it natively (no CFG-sampler shim needed) | | VLM value model checkpoint | [ybpy/vlm_ckpt](https://huggingface.co/ybpy/vlm_ckpt) (HF) / Google Drive | unchanged | ## Why pistar, not RLinf Both pistar and RLinf implement pi0.6 / RECAP and use the same value model (SigLIP-So400m + Gemma3-270M + 201-bin C51 head over `[-1, 0]`). The difference is the labeling pipeline and the validation regime: | Dimension | RLinf | pistar | |-------------------------------|------------------------|------------------------------------------------| | Backend | PyTorch | **JAX** | | Relation to openpi | vendors openpi | **fork of openpi** | | Validation | LIBERO simulation only | **real robot** (SO-101, AgileX PiPER) | | Advantage labeling | quantile from value model with no auxiliary labels | **VLM-based** `value_label` / `reward_label` (per-frame supervision) | | Conditioning at serving | CFG sampler required | **`adv_ind` is just a tokenizer input** — vanilla `serve_policy.py` works | For real-robot YAM the pistar path is a closer fit. It is also upstream-broken on `main` for Stages 4 + 5; we made them work with 15 targeted patches documented on [the patches page](patches.md).