# Overview ## What RECAP is **RECAP** — *RL with Experience and Corrections via Advantage-conditioned Policies* — is the offline RL algorithm that pi0.6 uses to self-improve from heterogeneous data: SFT demonstrations, autonomous rollouts, and operator-driven corrections. The mechanism in one sentence: train a VLM-based value model on the collected data, use it to classify each *autonomous* frame as high-advantage (`positive`) or low-advantage (`negative`), then continue fine-tuning the policy with the per-frame advantage class fed in as a **tokenized conditioning signal** (`adv_ind`). At inference, condition on `positive`. Key properties: - **Offline** — no online environment interaction needed during value model training. - **Heterogeneous data friendly** — successful demos, failed rollouts, and operator corrections all contribute meaningfully. - **Same VLA architecture** — the policy is the standard pi0.5/pi0.6 with a single extra tokenizer input (`adv_ind`). No new parameters at the architecture level; no CFG-style sampler at inference. ## The six stages | Stage | What it does | Tool | Output | |--------|---------------------------------------------------------------------------|-------------------------------|----------------------------------------------------| | 0 | Collect DAgger rollouts (operator pedals + keyboard for episode lifecycle) | `limb record …` | raw episodes under `recordings//` | | 1 | Convert to LeRobot v3.0 with five RECAP columns (+v3→v2.1) | `limb convert-lerobot --pistar` + `openpi convert_v3_to_v21.py` | `datasets/_pistar_v1_v21/` | | 2 | Initial pi0.5 SFT on demos | `openpi/scripts/train.py` | SFT checkpoint (e.g. `ttotmoon/yam-vial-place-pi05-v1`) | | 3 | pi0.6 **full** fine-tune from SFT, no VLM yet (limb-supplied `adv_ind`) | `pistar/scripts/train.py` | pi0.6 checkpoint | | 4 | Train the VLM value model on `value_label` | `pistar/scripts/train_value.py` (+ our 13 patches) | value model checkpoint | | 5 | Run the value model to relabel `adv_ind` on autonomous frames | `pistar/scripts/label_advantage_from_vlm.py` | dataset with VLM-classified `adv_ind` in place | | 6 | Continue pi0.6 **full** fine-tune on the relabeled dataset (full RECAP) | `pistar/scripts/train.py` | full-RECAP pi0.6 checkpoint | Each stage is its own page in this site with the exact command and expected output. ## How our pipeline differs from RLinf and Evo-RL Three real-robot RECAP implementations exist publicly. We use **pistar**. ```{list-table} :header-rows: 1 :widths: 25 25 25 25 * - Aspect - pistar (this site) - RLinf - Evo-RL * - Validation - real robot (SO-101, PiPER) — and now YAM via this work - LIBERO sim only - real robot (SO-101, AgileX PiPER) * - Backend - JAX (flax.nnx) - PyTorch - PyTorch (LeRobot 0.4.4) * - Repo relation to openpi - fork of openpi - vendors openpi - LeRobot-native, no openpi * - Conditioning at serving - tokenized `adv_ind` via openpi's standard tokenizer — vanilla `serve_policy.py` - CFG sampler with `cfgrl_guidance_scale` knob — needs a shim around serve - `Advantage: positive`/`negative` text appended to the task prompt * - Value labeling - VLM-based supervision on per-frame `value_label` + `reward_label`; advantage = N-step value-target rollout - Critic-Expert head, advantage = N-step lookahead on values, top-quantile binarization - same as pistar in spirit; different field names ``` ## Architecture at a glance ```text ┌──────────────────────────────────────────┐ │ YAM bimanual + 3 cameras (RealSense) │ │ + iKKEGOL foot pedal (phase trigger) │ └──────────────────┬───────────────────────┘ │ control loop @ 30 Hz ┌──────────────────▼───────────────────────┐ │ limb (Python) │ │ ├─ DAggerAgent (phase machine) │ │ ├─ OpenPIClient → 0.0.0.0:8111 │ │ └─ DAggerCollectionSession (s/SPACE) │ └──────────────────┬───────────────────────┘ │ recordings// │ ┌──────────────────▼───────────────────────┐ │ limb convert-lerobot --pistar │ │ ├─ adds 5 RECAP columns (Stage 1) │ │ └─ + openpi convert_v3_to_v21.py │ └──────────────────┬───────────────────────┘ │ datasets/.../v21/ │ ┌────────────────────────────────┼────────────────────────────────┐ │ │ │ ┌──▼──────────┐ ┌───────────▼──────────┐ ┌───────────▼──────────┐ │ openpi │ │ pistar │ │ pistar │ │ Stage 2 SFT │ │ Stage 3 LoRA-from-SFT│ │ Stage 4 train_value │ │ (JAX) │ │ (JAX, pistar fork) │ │ (JAX, our 13 patches)│ └──┬──────────┘ └───────────┬──────────┘ └───────────┬──────────┘ │ │ │ │ ckpt to HF │ │ value-model ckpt │ │ │ │ │ ┌────────────────────▼────────────────┐ │ │ │ pistar label_advantage_from_vlm.py │ │ │ │ (rewrites adv_ind in place) │ │ │ └────────────────────┬────────────────┘ │ │ │ │ │ ┌────────────────────▼────────────────┐ │ │ │ pistar Stage 6 train.py │ │ │ │ (full RECAP fine-tune) │ │ │ └────────────────────┬────────────────┘ │ │ │ ▼ ▼ ▼ openpi serve_policy.py :8111 │ │ websocket ▼ limb teleop / limb record (DAgger) │ └───── back to top: more rollouts → relabel → fine-tune ``` ## When to use which checkpoint | Goal | Run this | |---------------------------------------------------|-------------------------------------------------------| | Real RECAP improvement on >100 episodes | [Full Stages 1→6](stage6_recap.md), full fine-tune | | Quick end-to-end smoke on a single 24 GB GPU | [Stage 3 LoRA-from-SFT](stage3_lora.md), no Stage 4-5 | | Collect more rollouts for the next round | [Stage 0](stage0_collection.md) after serving any checkpoint | ```{note} **Default path is full fine-tuning** (multi-GPU, e.g. 8× H100). The LoRA variants documented across the stage pages are kept for single-GPU development or quick smoke tests; they share the same data path and YAM TrainConfig structure and produce the same architecture, just with the backbone frozen. ``` A note on scale (from the [pi0.6 paper](https://arxiv.org/abs/2511.14759) Appendix A-F): for each task, the paper uses **287–450 correction episodes per iteration**, sometimes across multiple iterations. On ~10 episodes the VLM value model overfits and Stage 4-5 adds little beyond Stage 3; on ~100 episodes it starts to matter; at ~300+ it matches the paper's regime. (yam-trainconfig-reference)= ## YAM TrainConfig reference All eight pi0.6 TrainConfigs we registered in [`pistar/src/openpi/training/config.py`](https://github.com/ybpy/pistar) form four `(train, infer)` pairs. The pair differs only by the `adv_ind_dropout` flag (`True` for training, `False` for serving so the positive tag is always present at inference). All eight share the same model architecture (`Pi0Config(pi05=True, pistar=True)`), the same 3-camera repack, the same `default_prompt` for the YAM vial-handover task, and `adapt_to_pi=False` (YAM joint conventions, not ALOHA's). ```{list-table} :header-rows: 1 :widths: 28 10 14 22 12 14 * - TrainConfig name - Variant - Init weights - Dataset (`repo_id`) - Stage - Purpose * - `pi06_yam_vial_30fps` - full - `pi05_base` - `local/vial_rollout_v1_v21` - 3 (full alt.) - From-scratch pi0.6 full fine-tune on the limb-supplied (`adv_ind ∈ {positive, none}`) dataset. The "ignore the SFT" baseline. * - `pi06_yam_vial_30fps_infer` - full - same - same - 3 (serve) - Serving config for the above (`adv_ind_dropout=False`). * - `pi06_yam_vial_30fps_lora` - LoRA - `pi05_base` - `local/vial_rollout_v1_v21` - 3 (LoRA alt.) - LoRA version of the above for single-24-GB-GPU dev. * - `pi06_yam_vial_30fps_lora_infer` - LoRA - same - same - 3 (serve) - Serving config for the LoRA-from-scratch variant. * - `pi06_yam_vial_30fps_lora_from_sft` - LoRA - **SFT** (`yam-vial-place-pi05-v1`) - `local/vial_rollout_v1_v21` - **3** (recommended) - The Stage 3 default: LoRA fine-tune starting from the openpi SFT checkpoint, no VLM relabel. * - `pi06_yam_vial_30fps_lora_from_sft_infer` - LoRA - same - same - 3 (serve) - Serving config for the Stage 3 LoRA-from-SFT checkpoint. * - `pi06_yam_vial_30fps_lora_from_sft_recap` - LoRA - **SFT** (`yam-vial-place-pi05-v1`) - **`local/vial_rollout_v1_v21_vlm_label`** - **6** (recommended) - The Stage 6 default: LoRA fine-tune on the **VLM-relabeled** copy. Only `repo_id` differs from the Stage 3 LoRA-from-SFT config. * - `pi06_yam_vial_30fps_lora_from_sft_recap_infer` - LoRA - same - same - 6 (serve) - Serving config for the Stage 6 RECAP LoRA checkpoint. * - `pi06_yam_vial_30fps_from_sft_recap` - **full** - **SFT** (`yam-vial-place-pi05-v1`) - **`local/vial_rollout_v1_v21_vlm_label`** - **6** (8× H100) - The Stage 6 paper-style recipe: full fine-tune (no LoRA, no freeze) on the VLM-relabeled copy, init from the SFT. `batch_size=56`. * - `pi06_yam_vial_30fps_from_sft_recap_infer` - full - same - same - 6 (serve) - Serving config for the Stage 6 full-fine-tune RECAP checkpoint. ``` ### Picking one | Situation | Config | |---------------------------------------------------|--------------------------------------------------| | Single 24 GB GPU, want to reproduce Stage 3 | `pi06_yam_vial_30fps_lora_from_sft` | | Single 24 GB GPU, want to reproduce Stage 6 (RECAP) | `pi06_yam_vial_30fps_lora_from_sft_recap` | | 8× H100, paper-style RECAP | `pi06_yam_vial_30fps_from_sft_recap` | | Skipping the SFT — pretraining from `pi05_base` | `pi06_yam_vial_30fps` (full) or `_lora` variant | | Serving any of the above | The matching `_infer` config |