# Stage 4 — VLM value model training (pistar, patched)

Train the SigLIP-So400m + Gemma3-270M + 201-bin C51 critic head on
per-frame `value_label` supervision. Output: a value model that
predicts `V(o_t)` from `(image, wrist_image, state, prompt)`.

```{warning}
**Pistar's `scripts/train_value.py` is upstream-broken on `main`.**
It imports a `ValueModelWeightLoader` that doesn't exist, depends on a
`gemma/gm/data/` directory that isn't shipped, references modules
renamed in modern `kauldron` / `etils`, and so on. We resolved all of
this with **patches 1-13** documented in full at
[the patches reference](patches.md). This page assumes those patches
are in place. ([Stage 5](stage5_advantage.md) needs 2 more patches —
14 and 15.)
```

## Required inputs

1. The [v2.1 dataset](stage1_conversion.md) with the five RECAP columns
   *and* the lerobot-cache symlink (`local/<dataset>` resolution).
2. The [VLM checkpoint bundle](setup.md#vlm-checkpoint-for-stage-4)
   at `~/Downloads/vlm_ckpt/` (or `$OPENPI_VLM_CKPT_DIR`).
3. The 13 pistar patches from [patches.md](patches.md) applied.

## Quick smoke test (5 steps, ~30 s)

Confirm the patched pipeline runs end-to-end before committing to a long
training run.

```bash
cd ~/limb/pistar
source ~/.venvs/pistar/bin/activate

XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
  python scripts/train_value.py \
    --data_dir ~/limb/datasets/vial_rollout_v1_v21 \
    --checkpoint_dir checkpoints/value_model/yam_vial_v1 \
    --batch_size 4 --num_train_steps 5 \
    --save_interval 100 --val_interval 0 \
    --load_pretrained \
    --tokenizer_path ~/vlm_ckpt/tokenizer.model \
    --wandb_mode disabled
```

A successful 5-step run prints (Chinese log strings are pistar
upstream; English commentary added):

```text
ℹ 使用本地 Gemma3 tokenizer: ~/vlm_ckpt/tokenizer.model
local_batch_size: 4
ℹ 数据集大小: 21286 帧                                  ← dataset reachable
ℹ 加载 SigLIP + Gemma3-270M 预训练权重...
Restoring checkpoint from ~/vlm_ckpt/gemma-3-270m/step_00020000.
Finished restoring checkpoint in 1.33 seconds.
ValueModelWeightLoader: restored 241 leaf arrays from .../step_00020000 (key=params, step=20000)
✓ 预训练权重加载完成
模型初始化完成
预取前几个batch以优化GPU利用率...
✓ 成功预取 3 个batch
JIT编译预热...
JIT编译完成，开始训练...                                  ← JIT done
Progress on:训练进度 1.00it/5.00it rate:12.1s/it           ← first step (compile)
Progress on:训练进度 5.00it/5.00it rate:1.9s/it elapsed:00:13   ← steady-state ~0.2 s/step
✓ 保存 checkpoint: .../checkpoints/value_model/yam_vial_v1/step_00000005
训练完成!
```

The 5-step checkpoint is 5.1 GB on disk (full value model — SigLIP +
Gemma3 + heads + EMA + step).

## Real training run

On the reference dataset (10 episodes, 21,286 frames) `~5000 steps` is
the sane scale; the bundle is already at step 20,000 from a prior
LIBERO run, so this is genuinely fine-tuning, not pretraining.

```bash
XLA_PYTHON_CLIENT_PREALLOCATE=false XLA_PYTHON_CLIENT_MEM_FRACTION=0.85 \
  python scripts/train_value.py \
    --data_dir ~/limb/datasets/vial_rollout_v1_v21 \
    --checkpoint_dir checkpoints/value_model/yam_vial_v1 \
    --batch_size 4 \
    --num_train_steps 5000 \
    --save_interval 1000 \
    --val_interval 0 \
    --load_pretrained \
    --tokenizer_path ~/vlm_ckpt/tokenizer.model \
    --wandb_mode disabled
```

At ~0.2 s / step that's ~17 minutes wall-clock. Checkpoints at 1k / 2k
/ 3k / 4k / 5k under `checkpoints/value_model/yam_vial_v1/step_*`.

Multi-GPU full-throttle on 8× H100:

```bash
accelerate launch --multi_gpu --num_processes=8 --mixed_precision=bf16 \
  $(which python) scripts/train_value.py \
    --data_dir <…> --checkpoint_dir <…> \
    --batch_size 64 --num_train_steps 30000 \
    --load_pretrained --tokenizer_path <…>/tokenizer.model
```

That matches pistar's documented default (30k steps, batch 64) — paper
scale.

## Tuning knobs

| Flag                          | Default | Notes                                                                              |
|-------------------------------|---------|------------------------------------------------------------------------------------|
| `--batch_size`                | 32      | Drop to 4–8 on a single 24 GB consumer GPU; raise to 64+ on H100s.                  |
| `--num_train_steps`           | 30000   | The bundle is already at step 20k; 5k more is plenty for small-task fine-tuning.   |
| `--peak_lr`                   | 2.5e-5  | Drop to 1e-5 if loss diverges; pistar default schedule is cosine.                   |
| `--load_pretrained`           | off     | **Required.** Invokes our `ValueModelWeightLoader` against the VLM bundle.         |
| `--tokenizer_path`            | (auto)  | Explicit path defeats pistar's hardcoded `/data/...` fallback search.              |
| `--freeze_mode`               | `all_backbones` | Default freezes SigLIP + LLM. `siglip_only` and `none` are slower but lower-bias. |
| `--wandb_mode`                | online  | Set to `disabled` for the first dry runs.                                          |
| `--val_data_dir` + `--val_interval` | off | Provide a held-out v2.1 dataset for periodic val loss; very useful at paper scale. |

## What the saved checkpoint contains

The orbax tree mirrors the bundle's structure:

```text
checkpoints/value_model/yam_vial_v1/step_00005000/
├── _CHECKPOINT_METADATA
├── _METADATA
├── array_metadatas/process_0
├── d/...
├── manifest.ocdbt
├── ocdbt.process_0/
└── _sharding
```

Top-level keys inside: `{params, ema_params, step}`. [Stage 5](stage5_advantage.md)
uses `ema_params` by default (`--use_ema`).

## Healthy training signals

- **Train loss** falls from ~5–7 initial to ~2–3 within a few hundred
  steps.
- **Val loss** (if provided) tracks train loss within ~2×.
- **Two-hot target distribution** stays bimodal — non-zero mass at both
  ends of the [-1, 0] support. If everything collapses to one bin, your
  dataset has only success-shaped or only failure-shaped value labels;
  collect the missing class.

## Common failures

| Symptom                                                                 | Diagnosis & fix                                                                                                                                  |
|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| `ImportError: cannot import name 'ValueModelWeightLoader'`              | Patch 1 not applied — add the class to `weight_loaders.py`. See [patches.md](patches.md).                                                         |
| `ModuleNotFoundError: No module named 'gemma.gm.data'`                  | Patch 2 not applied — copy the upstream gemma `gm/data/` directory in.                                                                          |
| `ModuleNotFoundError: No module named 'kauldron.ktyping'`               | Patch 3 — sed `kauldron.ktyping` → `kauldron.typing` in the copied `_functional.py` and `_transforms.py`.                                          |
| `AttributeError: module 'etils.edc' has no attribute 'ContextStack'`    | Patch 5 — replace the use with a local fallback class in `_dtype_params.py`.                                                                     |
| `ImportError: cannot import name 'console' from 'openpi.shared'`        | Patch 6 — drop in the `console.py` stub.                                                                                                          |
| `TypeError: DataConfig.__init__() got an unexpected keyword argument 'local_data_dir'` | Patch 8 — `build_value_data_config` uses the new `repo_id` API.                                                                                  |
| `AttributeError: module 'openpi.training.data_loader' has no attribute 'create_value_data_loader'` | Patch 10 — add the function.                                                                                                                     |
| `RESOURCE_EXHAUSTED: Out of memory while trying to allocate ...`        | Another GPU consumer is up. `nvidia-smi`; on multi-GPU rigs use `accelerate launch ... --num_processes=N` and divide batch size accordingly.    |
| Loss is NaN after a few steps                                           | Drop `--batch_size` and/or `--peak_lr`. Verify `value_label` / `reward_label` aren't `inf` (they should be in `[-1, 0]`).                          |

## Next

The value model is ready — [Stage 5](stage5_advantage.md) uses it to
relabel `adv_ind` on autonomous frames.