Post-training Large Language Models (LLMs) for long-horizon agentic tasks—such as software engineering, web browsing, and complex tool use—presents a persistent trade-off between computational efficiency and model generalization . While Supervised Fine-Tuning (SFT) is computationally inexpensive, it frequently suffers from out-of-domain (OOD) performance degradation and struggles to generalize beyond its training distribution . Conversely, end-to-end reinforcement learning (E2E RL) typically preserves OOD capabilities and achieves high in-domain accuracy, but it incurs massive compute costs due to the necessity of repeated, many-turn on-policy rollouts for every parameter update .
NVIDIA researchers have introduced PivotRL , a framework designed to bridge this gap . By operating on existing SFT trajectories, PivotRL aims to deliver the generalization benefits of E2E RL while maintaining the data efficiency associated with SFT .
The Architecture of a Pivot
The core of PivotRL is the transition from full-trajectory rollouts to targeted, turn-level updates . The framework identifies and utilizes two primary mechanisms: Pivot Filtering and Functional Rewards .
1. Pivot Filtering
In turn-level agentic training, every assistant completion at a model-call boundary is considered an action. PivotRL begins by extracting all assistant turns from an SFT dataset into a ‘pivot candidate’ pool.
The system then profiles these candidates offline using a frozen reference policy, π 0 . To optimize the training budget, PivotRL filters for pivots : specific states where local, on-policy rollouts exhibit high variance in outcomes. The filtering criteria are defined by two conditions:
- Nonzero empirical reward variance : .
- Low reward mean :
This approach addresses the uninformative-turn bottleneck. In group-normalized RL—specifically Group Relative Policy Optimization (GRPO)—turns where actions either uniformly succeed or uniformly fail result in a normalized advantage of zero, providing no meaningful gradient update. By focusing on mixed-outcome turns that remain difficult for the reference policy, PivotRL concentrates compute on states that provide the strongest learning signal.
2. Implementing Functional Rewards
Standard SFT-to-RL adaptations often rely on exact string matching with the demonstration data to assign rewards . However, in generative action spaces (e.g., shell commands or search queries), multiple functionally equivalent actions may diverge from the specific string in the training data .
PivotRL replaces strict matching with functional rewards , , where is the set of locally acceptable actions determined by a domain-specific verifier. These verifiers can range from normalized schema checks and string similarity to lightweight LLM-as-a-judge scoring.
Theoretical Foundations: Gradient Signal and OOD Retention
The effectiveness of these design choices is supported by two primary theoretical results:
- Theorem 3.2 (Reward Variance and GRPO Signal): The research team proved that the Fisher norm of the natural gradient of the statewise reward objective scales with the reward standard deviation. Specifically, the population GRPO score, . This validates the strategy of filtering for mixed-outcome pivots to maximize the local in-domain learning signal.
- Theorem 3.3 (Minimal KL Change): This theorem demonstrates that functional reward-based RL shifts probability mass toward acceptable actions while preserving the reference policy’s relative probability ordering for actions unrelated to the training task. Because the relative ranking of task-unrelated actions remains unchanged, PivotRL significantly mitigates the catastrophic forgetting and OOD degradation common in SFT.
Performance and Efficiency
The research team evaluated PivotRL using Qwen3-30B-A3B-Thinking-2507 as the base model across four agentic domains : conversational tool use , software engineering (SWE-Bench Verified), terminal control (Terminal-Bench), and web browsing (BrowseComp).
In-Domain Accuracy Gains
Compared to SFT on identical data, PivotRL achieved superior in-domain results:
- Average Gain: +14.11 points over the base model, compared to +9.94 points for SFT.
- Domain Specifics: PivotRL outperformed SFT on (+5.37), Terminal-Bench (+6.25), and BrowseComp (+9.80).
Out-of-Domain Retention
The most significant advantage was observed in OOD stability . While SFT caused an average regression of -9.83 across eight OOD benchmarks (including math and science QA), PivotRL maintained a near-zero average change of +0.21 . Notably, PivotRL achieved +10.04% higher OOD accuracy in non-agentic tasks compared to SFT .
Compute Efficiency on SWE-Bench
On SWE-Bench Verified, a rigorous standard for long-horizon agents, PivotRL demonstrated a substantial reduction in training overhead:
- Turn Efficiency: PivotRL reached accuracy levels comparable to E2E RL using 4x fewer rollout turns .
- Temporal Efficiency: Training was ~5.5x faster in wall-clock time than E2E RL when using the same number of compute nodes.
Key Takeaways
- Hybrid Efficiency: PivotRL combines the compute efficiency of Supervised Fine-Tuning (SFT) with the out-of-domain (OOD) generalization of End-to-End RL .
- Pivot Filtering: The framework identifies ‘pivots’—critical intermediate turns where sampled actions show high variance in success/failure, providing the strongest learning signals.
- Functional Verifiers: Instead of requiring exact text matches, PivotRL uses domain-specific verifiers to reward any functionally equivalent action.
- OOD Stability: Unlike SFT, PivotRL preserves the model’s performance on unrelated tasks (e.g., math) by maintaining the reference policy’s probability ordering for task-unrelated actions.
- Production Speed: It achieves accuracy comparable to E2E RL with 4x fewer rollout turns and ~5.5x faster training time, as proven in NVIDIA’s Nemotron-3-Super.
Check out the Paper . Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well.