Polar - NVIDIA's open-source agent reinforcement learning training framework - AiBoss

What is Polar?

Polar is an open-source agent reinforcement learning (ARL) training framework from NVIDIA. Its core innovation lies in its ability to train existing agent frameworks using RL algorithms such as GRPO without modifying their internal code. The framework captures token-level interaction data and reconstructs training trajectories by placing agents at the boundaries of LLM API calls, enabling complex code-based agent harnesses such as Codex CLI, Claude Code, Qwen Code, and Pi to directly become trainable RL environments.

Polar's main functions

API Proxy CaptureInsert an API proxy compatible with Anthropic, OpenAI, and Google styles between the agent and the inference server, transparently forwarding requests and logging prompts, sampled tokens, log probabilities, and responses.
Trajectory ReconstructionIt provides two strategies: per-request and prefix merging, to reconstruct multi-round model calls into RL trajectories that can be directly consumed by the trainer.
Asynchronous service architectureRollout Server is responsible for task scheduling and load balancing, while Gateway Nodes are responsible for runtime warm-up, agent execution, trajectory construction and evaluation, decoupling training and execution.
Multi Harness CompatibleBuilt-in quick adaptation for mainstream code intelligences such as Claude Code, Codex, Qwen Code, OpenCode, Pi, and Gemini CLI.
Containerized runtimeSupports Docker and rootless Apptainer, providing an isolated execution environment.

Polar's technical principles

Black Box Agency ParadigmPolar does not rewrite the agent harness as an env.init()/env.step() interface. Instead, it uses LLM API traffic as the rollout boundary, keeping the native execution logic of the harness unchanged.
Token Authenticity Tracking Reconstruction: Obtain token IDs and log probabilities directly from the inference backend to avoid retokenization drift and ensure strict alignment between training signals and behavioral policies.
Prefix Merging Algorithm: Detect the token-prefix relationship of prompts in multi-turn dialogues, merge append-only dialogue chains into longer training trajectories, and reduce the number of trainer updates.
Asynchronous phased executionGateway internally separates three independent work pools: INIT (runtime startup), RUN (harness execution), and POSTRUN (trajectory construction and evaluation), which, together with the READY buffer, enable runtime warm-up and GPU training to run in parallel.
Weight synchronization mechanismThe model weights are asynchronously synchronized between the Trainer and the Inference Server. Rollout continuously samples on the old policy, and the trainer performs policy updates after receiving enough trajectories.

How to use Polar

Deploy Polar serviceStart the Rollout Server and Gateway Nodes, and configure the Inference Server (such as SGLang).
Configure Harness: Point the model base URL of the target agent (such as Codex CLI) to the Polar Gateway proxy endpoint.
Write an adapterCreate a harness adapter (usually only requires configuring environment variables, provider settings, and startup commands).
Submit training taskSubmit a TaskRequest via the Polar API, specifying the harness, runtime, evaluator, and trajectory building strategy.
Access TrainerTraining frameworks (such as Slime and Megatron) receive trajectory data returned by Polar via callbacks and execute RL algorithms such as GRPO to update the data.

Polar's core advantages

Zero-intrusion integrationIt eliminates the need to modify the existing agent framework source code, thus lowering the technical barrier to entry for RL training.
Harness irrelevanceIt is compatible with any intelligent agent based on the LLM API, including closed-source binary programs.
Efficient resource utilizationThe asynchronous architecture enables CPU-intensive runtime to prepare for non-blocking GPU training, and prefix merging reduces training time by approximately 5.39 times.
Token-level authenticity: Capture the original token directly from the inference backend to avoid training signal distortion caused by text recoding.
Elastic scalingRollout-as-a-service is designed to support large-scale distributed asynchronous RL training.

Polar's project address

GitHub repository: https://github.com/NVIDIA-NeMo/ProRL-Agent-Server
arXiv technical paper: https://arxiv.org/pdf/2605.24220

Polar's Competitive Product Comparison

Dimension	Polar (Nvidia)	SkyRL-Agent	Agent Lightning
Core positioning	Rollout-as-a-Service infrastructure	Full-stack multi-round Agent RL training and evaluation system	Training-agent decoupled architecture + unified data interface
Integrated invasive	Zero IntrusionAPI proxy interception, no need to modify harness source code.	Needs to be rewrittenThe agent needs to be adapted to the Gymnasium-style interface.	Low intrusion: Requires integration with standard tracing API or SDK callback
Harness compatibility	Arbitrary black-box harness (including closed-source binary)	Agents implemented within the framework only	Agents that conform to the preset interface
Rollout boundary	LLM API Traffic Boundaries	Agent execution logic internal	Agent execution tracing layer
Asynchronous architecture	Native asynchronous service boundary (Server + Gateway Nodes)	Asynchronous operation is supported, but the agent is tightly coupled with the training process.	Limited asynchronous support
Trajectory Reconstruction	Token authenticity guarantee + Prefix Merging (reduces trainer updates)	Trajectory generated directly within the framework	Unified data interface conversion
Runtime isolation	Docker / Apptainer	Support containerization	Unclear
Training algorithm coupling	Algorithm-independent (GRPO / PPO, etc. can all be accessed)	Built-in algorithm optimization	Algorithm-independent
Representative scenarios	RL training using ready-made harnesses such as Codex, Claude Code, and Qwen Code.	Long-process, multi-round tool using agent training	Cross-frame agent training data collection

Applications of Polar

Code-based agent reinforcement learning: Perform RL fine-tuning on programming assistants such as Codex and Claude Code to improve the performance of software engineering benchmarks such as SWE-Bench.
Multi-round tool using Agent trainingTraining a long-running intelligent agent requires continuous calls to external tools (browser, database, API).
Offline SFT data generation: Use Polar to generate high-quality training data in batches on a custom harness for supervised fine-tuning.
Multi-agent cooperative optimizationEnd-to-end RL training for complex multi-agent systems involving sub-agent orchestration and context compression.
Closed-source agent evaluation and improvementBlack-box RL training and capability enhancement for closed-source intelligent agent products whose source code is unavailable.