Scaling Self-Play for End-to-End Driving

Video Overview

Method

Self-Play for End-to-End Driving

Three-stage self-play training pipeline: vectorized teacher, pixel-based student via self-play DAgger, and sim-to-real perception adaptation. — **Overview.** **(a) Vectorized teacher.** A compact policy is trained with self-play reinforcement learning over fast vectorized BEV observations, yielding a robust, naturalistic driving policy. **(b) Pixel-based student (self-play DAgger).** The teacher is distilled into a pixel-based end-to-end policy inside Gigapixel. Every agent in the scene is controlled by the student, and at each visited state a *forked* parallel simulator rolls out the teacher to generate per-agent trajectory targets. **(c) Sim-to-real perception adaptation.** To deploy on real sensor observations, we freeze the planning head and finetune only the perception backbone on paired simulated–real observations, mapping real images into the latent representation the planning head already acts on.

Why self-play? Behavior cloning learns from a fixed, narrow distribution of human logs and never sees the consequences of its own actions, so errors compound in closed-loop. In self-play, agents are paired with copies of themselves and learn through closed-loop interaction, which surfaces safety-critical interactions that are vanishingly rare in driving logs, and exposes the policy to the consequences of its own actions during training.

Simulator

The Gigapixel Renderer

Gigapixel extends the PufferDrive batched simulator with the GPU-accelerated Madrona renderer, exposing ego-centric perspective views rather than only vectorized BEV features. It renders a deliberately simple bounding-box world (vehicles and static objects as cuboids, lane polylines as thin planar strips, and traffic lights as small spheres) preserving the scene geometry and interaction fidelity needed for planning while sustaining 50k agent steps per second on a single GPU, scaling near-linearly with the number of GPUs.

Gigapixel Rollouts

The ego-centric pixel observations a policy receives during self-play, stitched across the forward camera views. Blue cuboids are surrounding agents, thin strips are lane polylines, and the small colored spheres are traffic lights.

Comparison of Gigapixel ray-traced and rasterized renderings of three scenes across resolutions from 64x64 to 512x512.

Two Rendering Backends

Gigapixel supports rasterized and ray-traced rendering. The rasterizer is faster, exploiting the simplicity of the primitives; the ray tracer trades throughput for higher visual fidelity, which is visible in the richer shading of the lower rows.

Agent steps-per-second versus rendering resolution for Gigapixel rasterizer and ray tracer compared to HUGSIM and RAP renderers.

Throughput vs. Resolution

Agent steps per second (SPS) across rendering resolutions and policy architectures on one NVIDIA A100 GPU. Render Only isolates renderer throughput, with no policy forward or backward pass.

The Gigapixel rasterizer is ~1000× faster than the Gaussian-splatting HUGSIM renderer and ~4000× faster than the CPU rasterizer RAP at 512×512.
The gap between the rasterizer (Rast.) and ray tracer (RT) narrows as model complexity grows (CNN → DrivoR). This confirms that with a heavy end-to-end model, the renderer is no longer the throughput bottleneck.

Results

Self-Play vs. Behavior Cloning

We compare two pixel-based DrivoR policies driving closed-loop in photorealistic reconstructed real-world scenes. Both share the same architecture; only the training signal differs.

Self-Play Trained (trained in Gigapixel; adapted to real observations) BC Trained (human logs) Planned Trajectory

Scenario 1 Same scene · two policies

Self-Play Trained

BC Trained

Scenario 2 Same scene · two policies

Self-Play Trained

BC Trained

Scenario 3 Same scene · two policies

Self-Play Trained

BC Trained

Scenario 4 Same scene · two policies

Self-Play Trained

BC Trained

Scenario 5 Same scene · two policies

Self-Play Trained

BC Trained

What to look for. The self-play policy anticipates hazards: it reduces speed and plans smooth corrective maneuvers (e.g., yielding to a decelerating lead vehicle or steering back from the road edge). The behavior-cloned policy, never exposed to the consequences of its own actions, tends to keep a centered, high-velocity straight-ahead plan and fails in the rare safety-critical situations that precede a stop or a recovery (rear-ending a stopped vehicle or drifting off-road). On HUGSIM, the average collision velocity of the self-play policy is 1.95 m/s versus 5.27 m/s for behavior cloning, a 2.7× reduction.

Gigapixel Driving Score versus global step for self-play DAgger versus self-play RL, with the vectorized teacher as an upper reference.

Self-Play DAgger is More Sample-Efficient than Self-Play RL

Using a lightweight CNN policy (for tractable RL experimentation), we compare distilling a privileged teacher via self-play DAgger against training the pixel policy directly with self-play RL.

Self-play DAgger surpasses a Gigapixel Driving Score of 60 in roughly 3000× fewer steps than self-play RL.
It quickly approaches the vectorized teacher's performance (dashed line, reached at 25B steps).
This motivates distilling an RL teacher rather than training a large pixel policy with RL from scratch.

Gigapixel Driving Score versus global step comparing self-play DAgger, single-agent DAgger, and behavior cloning.

Scaling Self-Play

Closed-loop performance of the DrivoR-Reg student as training experience scales in Gigapixel, across three end-to-end training strategies.

Self-play DAgger improves consistently with scale and overtakes both single-agent DAgger and behavior cloning beyond 10M steps.
Behavior cloning plateaus around 100M steps, as the student is never exposed to consequences of its own actions.
Self-play's edge comes from two effects: every agent in a rollout contributes supervised data, and the co-evolving interactions span more diverse, safety-critical states.

Headline results. Gigapixel-DrivoR reaches state-of-the-art on the closed-loop HUGSIM benchmark (38.5 HD-Score, 50.1 route completion) and competitive performance on NAVSIM-v2, without human trajectory supervision. Scaling self-play yields proportional gains in policy performance.

Citation

BibTeX

If you find this work useful, please consider citing it.

@article{rowe2026gigapixel,
  title   = {Scaling Self-Play for End-to-End Driving},
  author  = {Rowe, Luke and Girgis, Roger and de Schaetzen, Rodrigue and
             Cornelisse, Daphne and Grandhi, Alaap and Heide, Felix and
             Vinitsky, Eugene and Pal, Christopher and Paull, Liam},
  journal = {arXiv preprint arXiv:2606.19641},
  year    = {2026}
}