Evaluation — Serve + deploy via limb

Serve the trained pi0.6 checkpoint with openpi’s serve_policy.py and drive YAM through limb’s existing DAgger client. Because pistar’s adv_ind rides through the standard openpi tokenizer, there is no CFG-sampler shim required — the same serve_policy.py that serves a Stage 2 SFT works for a Stage 6 RECAP checkpoint.

Serve

cd ~/limb/pistar
source ~/.venvs/pistar/bin/activate

# Stage 6 (full fine-tune):
python scripts/serve_policy.py --port=8111 policy:checkpoint \
  --policy.config=pi06_yam_vial_30fps_from_sft_infer \
  --policy.dir=checkpoints/pi06_yam_vial_30fps_from_sft/stage6_v1/<step>

# Stage 3 LoRA-from-SFT smoke run:
python scripts/serve_policy.py --port=8111 policy:checkpoint \
  --policy.config=pi06_yam_vial_30fps_lora_from_sft_infer \
  --policy.dir=checkpoints/pi06_yam_vial_30fps_lora_from_sft/stage3_v0/<step>

Warning

The _infer suffix matters. The infer-variant TrainConfig has adv_ind_dropout=False so the positive tag is always present at inference. Using the non-infer variant at serving time means the tokenizer randomly drops adv_ind 90% of the time and you silently lose the RECAP conditioning.

Match the model variant

LoRA training → must serve through the matching _lora_*_infer config (LoRA-variant param tree differs from full-fine-tune tree; restoring one with the other’s config fails).
Full fine-tune → matching _from_sft_infer config.

limb side — add `adv_ind: "positive"` to the obs transform

For pistar / pi0.6 RECAP checkpoints, limb’s OpenPIObsTransform must emit adv_ind on every wire observation. Without it the server’s TokenizePrompt raises ValueError: Adv_ind is required. (see transforms.py:266 in pistar). The token’s value at serving time is fixed: "positive".

# configs/yam_pi0_bimanual.yaml  (or your dagger config)
agent:
  client:
    _target_: limb.agents.policy_learning.policy_client.OpenPIClient
    host: "0.0.0.0"
    port: 8111
  obs_transform:
    _target_: limb.agents.policy_learning.transforms.OpenPIObsTransform
    prompt: "Use one arm to grasp the papercup and hand it over to the other arm"
    # ⬇ NEW — required for any pi06_yam_vial_30fps[_lora][_from_sft][_recap]_infer
    #         config. Omit (or leave None) when serving vanilla pi0/pi0.5.
    adv_ind: "positive"
    image_keys:
      cam_high: "head_camera-images-rgb"
      cam_left_wrist: "left_wrist_camera-images-rgb"
      cam_right_wrist: "right_wrist_camera-images-rgb"
    image_size: [224, 224]
    state_keys:
      - "left-joint_pos"
      - "left-gripper_pos"
      - "right-joint_pos"
      - "right-gripper_pos"

The adv_ind_dropout=False on the _infer TrainConfig only controls the server-side tokenizer’s randomization — it doesn’t conjure the field out of nowhere. The client still has to send it.

Launch deployment:

cd ~/limb
source .venv/bin/activate

# Pure autonomous serve (no recording)
uv run limb teleop --config-path configs/yam_dagger_pi0_bimanual.yaml

# Or: continue collecting rollouts with this checkpoint
uv run limb record --config-path \
  configs/yam_dagger_pi0_bimanual.yaml \
  configs/dagger_collection.yaml

What’s served vs what limb sends

Wire field	limb sends	pistar’s tokenizer adds
`state`	YAM 14-D state (left + right joints + grippers)	padded to action_dim (32)
`images.cam_high`	resized head_camera (224×224, CHW)	tokenized by SigLIP
`images.cam_left_wrist`	resized left_wrist	tokenized by SigLIP
`images.cam_right_wrist`	resized right_wrist	tokenized by SigLIP
`prompt`	the task instruction string	concatenated with the `adv_ind` value into `"Task: <prompt>, State: <…>, Advantage: <adv_ind>;\nAction: "` and tokenized by the PaliGemma SentencePiece tokenizer
`adv_ind`	`"positive"` (pinned by `OpenPIObsTransform.adv_ind`) — required for pistar models	substituted into the `Advantage: …` clause above. `_infer` configs set `adv_ind_dropout=False`, so the clause is always present at inference.

Quantitative evaluation

For honest comparisons between checkpoints (Stage 2 SFT vs Stage 3 vs Stage 6), use the same prompt, same scene, and same number of trials.

# 1. Start the server with the candidate checkpoint
python scripts/serve_policy.py ...

# 2. Run N=50 trials, recording success/failure
uv run limb record --config-path \
  configs/yam_dagger_pi0_bimanual.yaml \
  configs/dagger_collection.yaml
# (now operator just labels s/SPACE per attempt; don't intervene)

# 3. Compute success rate
uv run python -c "
import glob
ok = sum((open(f).read()=='') for f in glob.glob('recordings/<task>_<ts>/episode_*/SUCCESS'))
total = len(glob.glob('recordings/<task>_<ts>/episode_*'))
print(f'{ok}/{total} = {100*ok/total:.1f}%')
"

Key invariant: an evaluation run is operator-passive. No bilateral teleop, no CORRECTING phase — just observe the policy and label s/SPACE. If you intervene, the resulting episode no longer reflects the policy’s autonomous performance.

Common deployment issues

Symptom	Likely cause
Robot moves identically to the pre-RECAP checkpoint	`_infer` config not used — `adv_ind` is dropping out. Re-serve with the `_*_infer` variant.
First request takes 30+ seconds	Normal — pistar JIT-compiles the forward pass on first call. Subsequent requests are ~150 ms.
Connection refused on `:8111`	Server not up. Check `nvidia-smi`; if pistar crashed on load, see Stage 4/6 troubleshooting in their pages.
Server returns shape mismatch	Config / checkpoint mismatch. LoRA-trained ckpt with full-fine-tune config (or vice versa) will fail to restore.
Robot jitters / oscillates	Action smoothing too tight. `action_horizon=50` + `smoothing_window=4` is the default; raise smoothing for jittery policies.
Policy stuck — won’t initiate task	Initial state too far from training distribution. Use right-pedal → CORRECTING to nudge into a known pose, then left-pedal back.

Closed loop — feeding evaluation back into training

A successful evaluation run produces a new batch of episodes (s/SPACE-labeled). Treat that batch as a new iteration:

limb convert-lerobot --pistar (Stage 1) on the new session.
Merge into the existing v3.0 dataset (e.g. with pistar/scripts/merge_datasets.py).
Re-run convert_v3_to_v21.py.
Re-train Stage 4 → 5 → 6 on the bigger dataset.

This is the data closed-loop the pi0.6 paper iterates 2–3 times for hard tasks.

Next

For the maintainer’s reference, the upstream-pistar bugs we patched to make this all work end-to-end → patches reference.