# Evaluation — Serve + deploy via limb Serve the trained pi0.6 checkpoint with openpi's `serve_policy.py` and drive YAM through limb's existing DAgger client. Because pistar's `adv_ind` rides through the **standard openpi tokenizer**, there is **no CFG-sampler shim required** — the same `serve_policy.py` that serves a Stage 2 SFT works for a Stage 6 RECAP checkpoint. ## Serve ```bash cd ~/limb/pistar source ~/.venvs/pistar/bin/activate # Stage 6 (full fine-tune): python scripts/serve_policy.py --port=8111 policy:checkpoint \ --policy.config=pi06_yam_vial_30fps_from_sft_infer \ --policy.dir=checkpoints/pi06_yam_vial_30fps_from_sft/stage6_v1/ # Stage 3 LoRA-from-SFT smoke run: python scripts/serve_policy.py --port=8111 policy:checkpoint \ --policy.config=pi06_yam_vial_30fps_lora_from_sft_infer \ --policy.dir=checkpoints/pi06_yam_vial_30fps_lora_from_sft/stage3_v0/ ``` ```{warning} The **`_infer` suffix matters**. The infer-variant TrainConfig has `adv_ind_dropout=False` so the positive tag is always present at inference. Using the non-infer variant at serving time means the tokenizer randomly drops `adv_ind` 90% of the time and you silently lose the RECAP conditioning. ``` ### Match the model variant - LoRA training → must serve through the matching `_lora_*_infer` config (LoRA-variant param tree differs from full-fine-tune tree; restoring one with the other's config fails). - Full fine-tune → matching `_from_sft_infer` config. ## limb side — add `adv_ind: "positive"` to the obs transform For pistar / pi0.6 RECAP checkpoints, limb's `OpenPIObsTransform` must emit `adv_ind` on every wire observation. Without it the server's `TokenizePrompt` raises `ValueError: Adv_ind is required.` (see [transforms.py:266](https://github.com/ybpy/pistar/blob/main/src/openpi/transforms.py#L266) in pistar). The token's value at serving time is fixed: `"positive"`. ```yaml # configs/yam_pi0_bimanual.yaml (or your dagger config) agent: client: _target_: limb.agents.policy_learning.policy_client.OpenPIClient host: "0.0.0.0" port: 8111 obs_transform: _target_: limb.agents.policy_learning.transforms.OpenPIObsTransform prompt: "Use one arm to grasp the papercup and hand it over to the other arm" # ⬇ NEW — required for any pi06_yam_vial_30fps[_lora][_from_sft][_recap]_infer # config. Omit (or leave None) when serving vanilla pi0/pi0.5. adv_ind: "positive" image_keys: cam_high: "head_camera-images-rgb" cam_left_wrist: "left_wrist_camera-images-rgb" cam_right_wrist: "right_wrist_camera-images-rgb" image_size: [224, 224] state_keys: - "left-joint_pos" - "left-gripper_pos" - "right-joint_pos" - "right-gripper_pos" ``` The `adv_ind_dropout=False` on the `_infer` TrainConfig only controls the **server-side** tokenizer's randomization — it doesn't conjure the field out of nowhere. The client still has to send it. Launch deployment: ```bash cd ~/limb source .venv/bin/activate # Pure autonomous serve (no recording) uv run limb teleop --config-path configs/yam_dagger_pi0_bimanual.yaml # Or: continue collecting rollouts with this checkpoint uv run limb record --config-path \ configs/yam_dagger_pi0_bimanual.yaml \ configs/dagger_collection.yaml ``` ## What's served vs what limb sends | Wire field | limb sends | pistar's tokenizer adds | |---------------------|-----------------------------------------|----------------------------------------------------------------| | `state` | YAM 14-D state (left + right joints + grippers) | padded to action_dim (32) | | `images.cam_high` | resized head_camera (224×224, CHW) | tokenized by SigLIP | | `images.cam_left_wrist` | resized left_wrist | tokenized by SigLIP | | `images.cam_right_wrist`| resized right_wrist | tokenized by SigLIP | | `prompt` | the task instruction string | concatenated with the `adv_ind` value into `"Task: , State: <…>, Advantage: ;\nAction: "` and tokenized by the PaliGemma SentencePiece tokenizer | | `adv_ind` | **`"positive"`** (pinned by `OpenPIObsTransform.adv_ind`) — required for pistar models | substituted into the `Advantage: …` clause above. `_infer` configs set `adv_ind_dropout=False`, so the clause is always present at inference. | ## Quantitative evaluation For honest comparisons between checkpoints (Stage 2 SFT vs Stage 3 vs Stage 6), use the same prompt, same scene, and same number of trials. ```bash # 1. Start the server with the candidate checkpoint python scripts/serve_policy.py ... # 2. Run N=50 trials, recording success/failure uv run limb record --config-path \ configs/yam_dagger_pi0_bimanual.yaml \ configs/dagger_collection.yaml # (now operator just labels s/SPACE per attempt; don't intervene) # 3. Compute success rate uv run python -c " import glob ok = sum((open(f).read()=='') for f in glob.glob('recordings/_/episode_*/SUCCESS')) total = len(glob.glob('recordings/_/episode_*')) print(f'{ok}/{total} = {100*ok/total:.1f}%') " ``` Key invariant: an evaluation run is **operator-passive**. No bilateral teleop, no CORRECTING phase — just observe the policy and label `s`/`SPACE`. If you intervene, the resulting episode no longer reflects the policy's autonomous performance. ## Common deployment issues | Symptom | Likely cause | |------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| | Robot moves identically to the pre-RECAP checkpoint | `_infer` config not used — `adv_ind` is dropping out. Re-serve with the `_*_infer` variant. | | First request takes 30+ seconds | Normal — pistar JIT-compiles the forward pass on first call. Subsequent requests are ~150 ms. | | Connection refused on `:8111` | Server not up. Check `nvidia-smi`; if pistar crashed on load, see Stage 4/6 troubleshooting in their pages. | | Server returns shape mismatch | Config / checkpoint mismatch. LoRA-trained ckpt with full-fine-tune config (or vice versa) will fail to restore. | | Robot jitters / oscillates | Action smoothing too tight. `action_horizon=50` + `smoothing_window=4` is the default; raise smoothing for jittery policies. | | Policy stuck — won't initiate task | Initial state too far from training distribution. Use right-pedal → CORRECTING to nudge into a known pose, then left-pedal back. | ## Closed loop — feeding evaluation back into training A successful evaluation run produces a new batch of episodes (s/SPACE-labeled). Treat that batch as a new iteration: 1. `limb convert-lerobot --pistar` ([Stage 1](stage1_conversion.md)) on the new session. 2. Merge into the existing v3.0 dataset (e.g. with `pistar/scripts/merge_datasets.py`). 3. Re-run `convert_v3_to_v21.py`. 4. Re-train Stage 4 → 5 → 6 on the bigger dataset. This is the data closed-loop the pi0.6 paper iterates 2–3 times for hard tasks. ## Next For the maintainer's reference, the upstream-pistar bugs we patched to make this all work end-to-end → [patches reference](patches.md).