Stage 4 — VLM value model training (pistar, patched)
Train the SigLIP-So400m + Gemma3-270M + 201-bin C51 critic head on
per-frame value_label supervision. Output: a value model that
predicts V(o_t) from (image, wrist_image, state, prompt).
Warning
Pistar’s scripts/train_value.py is upstream-broken on main.
It imports a ValueModelWeightLoader that doesn’t exist, depends on a
gemma/gm/data/ directory that isn’t shipped, references modules
renamed in modern kauldron / etils, and so on. We resolved all of
this with patches 1-13 documented in full at
the patches reference. This page assumes those patches
are in place. (Stage 5 needs 2 more patches —
14 and 15.)
Required inputs
The v2.1 dataset with the five RECAP columns and the lerobot-cache symlink (
local/<dataset>resolution).The VLM checkpoint bundle at
~/Downloads/vlm_ckpt/(or$OPENPI_VLM_CKPT_DIR).The 13 pistar patches from patches.md applied.
Quick smoke test (5 steps, ~30 s)
Confirm the patched pipeline runs end-to-end before committing to a long training run.
cd ~/limb/pistar
source ~/.venvs/pistar/bin/activate
XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
python scripts/train_value.py \
--data_dir ~/limb/datasets/vial_rollout_v1_v21 \
--checkpoint_dir checkpoints/value_model/yam_vial_v1 \
--batch_size 4 --num_train_steps 5 \
--save_interval 100 --val_interval 0 \
--load_pretrained \
--tokenizer_path ~/vlm_ckpt/tokenizer.model \
--wandb_mode disabled
A successful 5-step run prints (Chinese log strings are pistar upstream; English commentary added):
ℹ 使用本地 Gemma3 tokenizer: ~/vlm_ckpt/tokenizer.model
local_batch_size: 4
ℹ 数据集大小: 21286 帧 ← dataset reachable
ℹ 加载 SigLIP + Gemma3-270M 预训练权重...
Restoring checkpoint from ~/vlm_ckpt/gemma-3-270m/step_00020000.
Finished restoring checkpoint in 1.33 seconds.
ValueModelWeightLoader: restored 241 leaf arrays from .../step_00020000 (key=params, step=20000)
✓ 预训练权重加载完成
模型初始化完成
预取前几个batch以优化GPU利用率...
✓ 成功预取 3 个batch
JIT编译预热...
JIT编译完成,开始训练... ← JIT done
Progress on:训练进度 1.00it/5.00it rate:12.1s/it ← first step (compile)
Progress on:训练进度 5.00it/5.00it rate:1.9s/it elapsed:00:13 ← steady-state ~0.2 s/step
✓ 保存 checkpoint: .../checkpoints/value_model/yam_vial_v1/step_00000005
训练完成!
The 5-step checkpoint is 5.1 GB on disk (full value model — SigLIP + Gemma3 + heads + EMA + step).
Real training run
On the reference dataset (10 episodes, 21,286 frames) ~5000 steps is
the sane scale; the bundle is already at step 20,000 from a prior
LIBERO run, so this is genuinely fine-tuning, not pretraining.
XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
python scripts/train_value.py \
--data_dir ~/limb/datasets/vial_rollout_v1_v21 \
--checkpoint_dir checkpoints/value_model/yam_vial_v1 \
--batch_size 4 \
--num_train_steps 5000 \
--save_interval 1000 \
--val_interval 0 \
--load_pretrained \
--tokenizer_path ~/vlm_ckpt/tokenizer.model \
--wandb_mode disabled
At ~0.2 s / step that’s ~17 minutes wall-clock. Checkpoints at 1k / 2k
/ 3k / 4k / 5k under checkpoints/value_model/yam_vial_v1/step_*.
Multi-GPU full-throttle on 8× H100:
accelerate launch --multi_gpu --num_processes=8 --mixed_precision=bf16 \
$(which python) scripts/train_value.py \
--data_dir <…> --checkpoint_dir <…> \
--batch_size 64 --num_train_steps 30000 \
--load_pretrained --tokenizer_path <…>/tokenizer.model
That matches pistar’s documented default (30k steps, batch 64) — paper scale.
Tuning knobs
Flag |
Default |
Notes |
|---|---|---|
|
32 |
Drop to 4–8 on a single 24 GB consumer GPU; raise to 64+ on H100s. |
|
30000 |
The bundle is already at step 20k; 5k more is plenty for small-task fine-tuning. |
|
2.5e-5 |
Drop to 1e-5 if loss diverges; pistar default schedule is cosine. |
|
off |
Required. Invokes our |
|
(auto) |
Explicit path defeats pistar’s hardcoded |
|
|
Default freezes SigLIP + LLM. |
|
online |
Set to |
|
off |
Provide a held-out v2.1 dataset for periodic val loss; very useful at paper scale. |
What the saved checkpoint contains
The orbax tree mirrors the bundle’s structure:
checkpoints/value_model/yam_vial_v1/step_00005000/
├── _CHECKPOINT_METADATA
├── _METADATA
├── array_metadatas/process_0
├── d/...
├── manifest.ocdbt
├── ocdbt.process_0/
└── _sharding
Top-level keys inside: {params, ema_params, step}. Stage 5
uses ema_params by default (--use_ema).
Healthy training signals
Train loss falls from ~5–7 initial to ~2–3 within a few hundred steps.
Val loss (if provided) tracks train loss within ~2×.
Two-hot target distribution stays bimodal — non-zero mass at both ends of the [-1, 0] support. If everything collapses to one bin, your dataset has only success-shaped or only failure-shaped value labels; collect the missing class.
Common failures
Symptom |
Diagnosis & fix |
|---|---|
|
Patch 1 not applied — add the class to |
|
Patch 2 not applied — copy the upstream gemma |
|
Patch 3 — sed |
|
Patch 5 — replace the use with a local fallback class in |
|
Patch 6 — drop in the |
|
Patch 8 — |
|
Patch 10 — add the function. |
|
Another GPU consumer is up. |
Loss is NaN after a few steps |
Drop |
Next
The value model is ready — Stage 5 uses it to
relabel adv_ind on autonomous frames.