Stage 4 — VLM value model training (pistar, patched)

Train the SigLIP-So400m + Gemma3-270M + 201-bin C51 critic head on per-frame value_label supervision. Output: a value model that predicts V(o_t) from (image, wrist_image, state, prompt).

Warning

Pistar’s scripts/train_value.py is upstream-broken on main. It imports a ValueModelWeightLoader that doesn’t exist, depends on a gemma/gm/data/ directory that isn’t shipped, references modules renamed in modern kauldron / etils, and so on. We resolved all of this with patches 1-13 documented in full at the patches reference. This page assumes those patches are in place. (Stage 5 needs 2 more patches — 14 and 15.)

Required inputs

  1. The v2.1 dataset with the five RECAP columns and the lerobot-cache symlink (local/<dataset> resolution).

  2. The VLM checkpoint bundle at ~/Downloads/vlm_ckpt/ (or $OPENPI_VLM_CKPT_DIR).

  3. The 13 pistar patches from patches.md applied.

Quick smoke test (5 steps, ~30 s)

Confirm the patched pipeline runs end-to-end before committing to a long training run.

cd ~/limb/pistar
source ~/.venvs/pistar/bin/activate

XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
  python scripts/train_value.py \
    --data_dir ~/limb/datasets/vial_rollout_v1_v21 \
    --checkpoint_dir checkpoints/value_model/yam_vial_v1 \
    --batch_size 4 --num_train_steps 5 \
    --save_interval 100 --val_interval 0 \
    --load_pretrained \
    --tokenizer_path ~/vlm_ckpt/tokenizer.model \
    --wandb_mode disabled

A successful 5-step run prints (Chinese log strings are pistar upstream; English commentary added):

ℹ 使用本地 Gemma3 tokenizer: ~/vlm_ckpt/tokenizer.model
local_batch_size: 4
ℹ 数据集大小: 21286 帧                                  ← dataset reachable
ℹ 加载 SigLIP + Gemma3-270M 预训练权重...
Restoring checkpoint from ~/vlm_ckpt/gemma-3-270m/step_00020000.
Finished restoring checkpoint in 1.33 seconds.
ValueModelWeightLoader: restored 241 leaf arrays from .../step_00020000 (key=params, step=20000)
✓ 预训练权重加载完成
模型初始化完成
预取前几个batch以优化GPU利用率...
✓ 成功预取 3 个batch
JIT编译预热...
JIT编译完成,开始训练...                                  ← JIT done
Progress on:训练进度 1.00it/5.00it rate:12.1s/it           ← first step (compile)
Progress on:训练进度 5.00it/5.00it rate:1.9s/it elapsed:00:13   ← steady-state ~0.2 s/step
✓ 保存 checkpoint: .../checkpoints/value_model/yam_vial_v1/step_00000005
训练完成!

The 5-step checkpoint is 5.1 GB on disk (full value model — SigLIP + Gemma3 + heads + EMA + step).

Real training run

On the reference dataset (10 episodes, 21,286 frames) ~5000 steps is the sane scale; the bundle is already at step 20,000 from a prior LIBERO run, so this is genuinely fine-tuning, not pretraining.

XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
  python scripts/train_value.py \
    --data_dir ~/limb/datasets/vial_rollout_v1_v21 \
    --checkpoint_dir checkpoints/value_model/yam_vial_v1 \
    --batch_size 4 \
    --num_train_steps 5000 \
    --save_interval 1000 \
    --val_interval 0 \
    --load_pretrained \
    --tokenizer_path ~/vlm_ckpt/tokenizer.model \
    --wandb_mode disabled

At ~0.2 s / step that’s ~17 minutes wall-clock. Checkpoints at 1k / 2k / 3k / 4k / 5k under checkpoints/value_model/yam_vial_v1/step_*.

Multi-GPU full-throttle on 8× H100:

accelerate launch --multi_gpu --num_processes=8 --mixed_precision=bf16 \
  $(which python) scripts/train_value.py \
    --data_dir <…> --checkpoint_dir <…> \
    --batch_size 64 --num_train_steps 30000 \
    --load_pretrained --tokenizer_path <…>/tokenizer.model

That matches pistar’s documented default (30k steps, batch 64) — paper scale.

Tuning knobs

Flag

Default

Notes

--batch_size

32

Drop to 4–8 on a single 24 GB consumer GPU; raise to 64+ on H100s.

--num_train_steps

30000

The bundle is already at step 20k; 5k more is plenty for small-task fine-tuning.

--peak_lr

2.5e-5

Drop to 1e-5 if loss diverges; pistar default schedule is cosine.

--load_pretrained

off

Required. Invokes our ValueModelWeightLoader against the VLM bundle.

--tokenizer_path

(auto)

Explicit path defeats pistar’s hardcoded /data/... fallback search.

--freeze_mode

all_backbones

Default freezes SigLIP + LLM. siglip_only and none are slower but lower-bias.

--wandb_mode

online

Set to disabled for the first dry runs.

--val_data_dir + --val_interval

off

Provide a held-out v2.1 dataset for periodic val loss; very useful at paper scale.

What the saved checkpoint contains

The orbax tree mirrors the bundle’s structure:

checkpoints/value_model/yam_vial_v1/step_00005000/
├── _CHECKPOINT_METADATA
├── _METADATA
├── array_metadatas/process_0
├── d/...
├── manifest.ocdbt
├── ocdbt.process_0/
└── _sharding

Top-level keys inside: {params, ema_params, step}. Stage 5 uses ema_params by default (--use_ema).

Healthy training signals

  • Train loss falls from ~5–7 initial to ~2–3 within a few hundred steps.

  • Val loss (if provided) tracks train loss within ~2×.

  • Two-hot target distribution stays bimodal — non-zero mass at both ends of the [-1, 0] support. If everything collapses to one bin, your dataset has only success-shaped or only failure-shaped value labels; collect the missing class.

Common failures

Symptom

Diagnosis & fix

ImportError: cannot import name 'ValueModelWeightLoader'

Patch 1 not applied — add the class to weight_loaders.py. See patches.md.

ModuleNotFoundError: No module named 'gemma.gm.data'

Patch 2 not applied — copy the upstream gemma gm/data/ directory in.

ModuleNotFoundError: No module named 'kauldron.ktyping'

Patch 3 — sed kauldron.ktypingkauldron.typing in the copied _functional.py and _transforms.py.

AttributeError: module 'etils.edc' has no attribute 'ContextStack'

Patch 5 — replace the use with a local fallback class in _dtype_params.py.

ImportError: cannot import name 'console' from 'openpi.shared'

Patch 6 — drop in the console.py stub.

TypeError: DataConfig.__init__() got an unexpected keyword argument 'local_data_dir'

Patch 8 — build_value_data_config uses the new repo_id API.

AttributeError: module 'openpi.training.data_loader' has no attribute 'create_value_data_loader'

Patch 10 — add the function.

RESOURCE_EXHAUSTED: Out of memory while trying to allocate ...

Another GPU consumer is up. nvidia-smi; on multi-GPU rigs use accelerate launch ... --num_processes=N and divide batch size accordingly.

Loss is NaN after a few steps

Drop --batch_size and/or --peak_lr. Verify value_label / reward_label aren’t inf (they should be in [-1, 0]).

Next

The value model is ready — Stage 5 uses it to relabel adv_ind on autonomous frames.