Strict On-Policy Training with Optimal Baseline: Microsoft Introduces Simplified Algorithm for RLHF
4 June 2025
Strict On-Policy Training with Optimal Baseline: Microsoft Introduces Simplified Algorithm for RLHF
The Microsoft Research team introduced On-Policy RL with Optimal reward baseline (OPO) — a simplified reinforcement learning algorithm for aligning large language models. The new method addresses key problems of…