• Home
  • Publication
  • Experience
  • Selected Publications
    • One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
    • Reinforcement Learning on Pre-Training Data
    • Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts
    • On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding
    • ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving
    • Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts
    • Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation
    • Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
    • ChatEDA: A Large Language Model Powered Autonomous Agent for EDA
    • p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
  • Experience

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Sep 30, 2025·
Rui Ming
Haoyuan Wu
Haoyuan Wu
,
Shoubo Hu
,
Zhuolun He
,
Bei Yu
· 0 min read
Paper
Type
Conference paper
Publication
arXiv:2509.26313 (2025)
Last updated on Sep 30, 2025
Large Language Models
Haoyuan Wu
Authors
Haoyuan Wu
Ph.D. Student

Reinforcement Learning on Pre-Training Data Sep 24, 2025 →