One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
Sep 30, 2025·
,
,
,
·
0 min read
Rui Ming
Haoyuan Wu
Shoubo Hu
Zhuolun He
Bei Yu