# Stage 5 — Advantage labeling (VLM relabel of `adv_ind`) Use the value model from [Stage 4](stage4_value.md) to compute an N-step advantage per autonomous frame, percentile-binarize, and write the result back into the dataset's `adv_ind` column **in place**. ```{warning} This step **modifies the v2.1 dataset on disk** — `adv_ind` values are overwritten for every autonomous frame. **Always run it against a copy**, not the original from [Stage 1](stage1_conversion.md). That way [Stage 3](stage3_lora.md) (pre-VLM LoRA-from-SFT) and [Stage 6](stage6_recap.md) (post-VLM full RECAP) can both re-use their respective dataset variants for comparison and re-runs. ``` ## Make a standalone copy of the dataset The v2.1 layout symlinks its data parquets back to the v3.0 originals (see [Stage 1](stage1_conversion.md) § *Why two converters*). A naive `cp -r` would preserve those symlinks and Stage 5 would write *through* them, corrupting the v3.0 source. Use **`cp -rL`** to materialize the parquets into a standalone tree, then register a fresh lerobot-cache symlink for the copy: ```bash cd ~/limb/datasets # 1. Materialize a copy (follows symlinks → standalone files, ~54 MB) cp -rL vial_rollout_v1_v21 vial_rollout_v1_v21_vlm_label # 2. Register the copy in pistar's lerobot cache so it resolves # `repo_id="local/vial_rollout_v1_v21_vlm_label"` to this path ln -sfn ~/limb/datasets/vial_rollout_v1_v21_vlm_label \ ~/.cache/huggingface/lerobot/local/vial_rollout_v1_v21_vlm_label # 3. Confirm the copy is standalone (and that the original is reachable) python3 -c " import pyarrow.parquet as pq, glob, collections for label, root in [('original', '~/limb/datasets/vial_rollout_v1_v21'), ('copy', '~/limb/datasets/vial_rollout_v1_v21_vlm_label')]: f = sorted(glob.glob(f'{root}/data/**/*.parquet', recursive=True))[0] print(label, '→', dict(collections.Counter(pq.read_table(f, columns=['adv_ind'])['adv_ind'].to_pylist()))) " # Before Stage 5: both labels print the same {none: …, positive: …} distribution. # After Stage 5: original is unchanged; copy has {positive: …, negative: …, none == 0}. ``` ```{tip} Subsequent RECAP iterations follow the same pattern: each round, make a *new* copy (e.g. `..._vlm_label_v2`) before running Stage 5 so you can compare iteration N against N-1 and roll back if a relabel goes bad. ``` ```{note} `label_advantage_from_vlm.py` is a separate script that ships its own copy of the data-config block and the `GemmaValueTokenizer` class, so the API-drift fixes from Stage 4 don't reach it. You must apply **patches 14 and 15** from [the patches reference](patches.md) to this script too before running. ``` ## What it does Per pistar's `scripts/label_advantage_from_vlm.py` docstring (verbatim): > 1) Classify each episode by `intervention`: all-1 episodes are demos > and are skipped; episodes with any 0 are rollouts and are fully > relabeled. > 2) Run VLM value inference for rollout rows and the lookahead endpoint > rows needed to compute their N-step advantage. > 3) Convert 201-dim logits → softmax → expectation over supports in > `[-1.0, 0.0]`. > 4) Compute N-step Advantage per rollout time step: > `A_t = sum_{k=0}^{N-1} r_{t+k} + V_{t+N} - V_t`. > 5) Compute the percentile threshold over rollout advantages of > non-intervention steps. > 6) For rollout rows only: > - if `intervention = 1`, set `adv_ind = positive` > - if `intervention = 0`, mark the configured top percentage as > `positive`, otherwise `negative`. > > Existing labels on rollout rows are overwritten; demo rows are > preserved. For your dataset that means: intervention frames stay `positive`; autonomous frames (previously all `none`) are now either `positive` or `negative` based on whether the VLM thinks they were high-value transitions. ## Command ```bash cd ~/limb/pistar source ~/.venvs/pistar/bin/activate python scripts/label_advantage_from_vlm.py \ --data_dir ~/limb/datasets/vial_rollout_v1_v21_vlm_label \ --checkpoint_dir checkpoints/value_model/yam_vial_v1/step_00005000 \ --tokenizer_path ~/vlm_ckpt/tokenizer.model \ --batch_size 8 \ --lookahead 50 \ --human_col intervention \ --adv_col adv_ind \ --base_image_col observation.images.head_camera \ --wrist_image_col observation.images.left_wrist_camera \ --right_wrist_image_col observation.images.right_wrist_camera \ --use_ema ``` ```{important} `--data_dir` points at the **copy** (`..._vlm_label`), not the [Stage 1](stage1_conversion.md) original. The original stays unchanged so Stage 3 reproductions and rollback comparisons still work. `--checkpoint_dir` accepts a specific step (recommended) so the script loads the version you trained for, not the latest auto-pick. Runs on ~21k frames take ~10–12 min at batch 8 on a single 24 GB GPU; multi-GPU will scale. ``` ### Flag explanations | Flag | Notes | |---------------------------------|------------------------------------------------------------------------------------------------------------------------------------| | `--data_dir` | Same v2.1 dataset Stage 4 trained on. Overwritten in place — back up first. | | `--checkpoint_dir` | Stage 4 output dir. Picks the latest `step_*` automatically; override with `--checkpoint_name step_XXXXX` for a specific one. | | `--use_ema` | Use the EMA-smoothed params (`ema_params` subtree). Pistar default; generally less noisy than the live `params` copy. | | `--lookahead 50` | N-step horizon for `A_t`. Pistar default. Drop to 10–20 for short episodes. | | `--human_col intervention` | Our column name (limb's `--pistar` convention). Pistar's default. | | `--adv_col adv_ind` | Our column name. Pistar's default. | | `--base_image_col` | Pass with dots — pistar's `_column_candidates` uses dotted names verbatim (no `observation/` prefix expansion). | | `--wrist_image_col` | Same convention. | | `--right_wrist_image_col` | Our second wrist; pistar's value model only consumes one wrist, but `label_advantage_from_vlm.py` exposes both for the value calc. | ### Tuning - `--positive_ratio 0.3` (default in pistar): top 30% of autonomous-frame advantages become `positive`. Bump to 0.2 for a stricter positive set. - `--batch_size`: increase for faster inference if your GPU has the memory. On a 24 GB consumer GPU 8 is comfortable; on H100 set 32–64. ## Verify the relabel After the run, every frame's `adv_ind` should be in `{positive, negative, none}` — and on rollout-only datasets there should be **zero `none`s** (every autonomous frame got classified). ```bash uv run python <<'PY' import glob, pyarrow.parquet as pq, collections DATA = "~/limb/datasets/vial_rollout_v1_v21_vlm_label" counts = collections.Counter() intervention_pos, intervention_neg = 0, 0 auto_pos, auto_neg, auto_none = 0, 0, 0 for f in sorted(glob.glob(f"{DATA}/data/**/*.parquet", recursive=True)): t = pq.read_table(f).to_pandas() counts.update(t["adv_ind"]) iv = t["intervention"].astype(bool).values av = t["adv_ind"].values intervention_pos += int(((iv) & (av == "positive")).sum()) intervention_neg += int(((iv) & (av == "negative")).sum()) auto_pos += int(((~iv) & (av == "positive")).sum()) auto_neg += int(((~iv) & (av == "negative")).sum()) auto_none += int(((~iv) & (av == "none")).sum()) print("adv_ind global:", dict(counts)) print(f"intervention=1 frames: {intervention_pos} positive {intervention_neg} negative") print(f"intervention=0 frames: {auto_pos} positive {auto_neg} negative {auto_none} none") PY ``` **Actual output on the reference 10-episode dataset after a clean Stage 5 run** (`--lookahead 50 --use_ema`, default `--positive_ratio 0.3`): ```text adv_ind global: {'negative': 10022, 'positive': 11264} intervention=1 frames: 6968 positive 0 negative intervention=0 frames: 4296 positive 10022 negative 0 none ``` - `intervention=1 ... 0 negative` — intervention frames are never negative (the script preserves them as positive). - `intervention=0 ... 0 none` — every autonomous frame got a verdict. - `4296 / (4296 + 10022) = 30.0%` of autonomous frames are positive — matches `--positive_ratio 0.3` to the percent. If your run produces non-zero `none` counts under `intervention=0`, the script crashed mid-run; re-run (the relabel is idempotent). If `none` count is non-zero after Stage 5, the script crashed mid-run. Re-run; the relabel is idempotent. ## Operating principle (intuition) The value model has learned `V(o) ≈ how-close-to-goal-is-this-state`. N-step advantage approximates `did the policy actually improve over the next N steps?` — positive advantage means the autonomous trajectory was making progress, negative advantage means it was making things worse. Stage 6 then learns from this: - Conditioned `positive` → the policy reproduces frames that the value model considered progress (correction frames + good autonomous runs). - Conditioned `negative` → the policy *avoids* the action distribution the bad autonomous frames came from. At inference you always condition `positive`. ## Next Run the final pi0.6 fine-tune → [Stage 6](stage6_recap.md).