AI Training18 min readJune 16, 2026
Verifier-Calibrated On-Policy Distillation
A concrete post-training algorithm that combines on-policy sampling, verifier rewards, teacher logits, clipping, and replay so models can learn new capabilities without catastrophic forgetting.
Train on student-generated states instead of relying only on fixed datasets or teacher rollouts.
Use verifier rewards to decide which trajectories deserve dense teacher guidance.
Clip teacher-token influence and replay older skills to reduce forgetting.