EN / 中文

Technical Details

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

1 Zhejiang University 2 Microsoft Research 3 Independent Researcher

* Equal contribution Work done during an internship at MSRA Corresponding authors

World-R1 aligns text-to-video generation with 3D constraints through reinforcement learning, without changing the base architecture or adding inference-time 3D control modules.

Training Data

~3,000

Pure-text world-simulation prompts

Dynamic Subset

~500

High-entropy motion prompts

3D Consistency

27.67

Best PSNR from World-R1-Large

MVCS

0.993

Reconstruction-independent consistency

User Preference

86%

Overall win rate over Wan2.1

Backbones

1.3B / 14B

Wan2.1 variants trained on 48 / 96 H200 GPUs

01

Abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

02

Pipeline Overview

World-R1 pipeline
Text prompt -> implicit camera conditioning -> rollout video generation -> 3D-aware and general rewards -> Flow-GRPO-Fast update.

Training Flow

  • Parse text motion tokens into deterministic camera extrinsics, then project the trajectory into dense optical flow.
  • Inject the camera prior into the initial latent noise with discrete noise transport, without adding a control network.
  • Generate grouped rollout videos with stochastic Flow-GRPO-Fast sampling.
  • Lift videos with Depth Anything 3 into 3DGS and score meta-view plausibility, reconstruction fidelity, and trajectory alignment.
  • Combine the 3D-aware reward with HPSv3 visual-quality feedback, then periodically train on dynamic prompts using only the general reward.

03

Core Components

Implicit Camera Conditioning

Prompt-specified push, pull, pan, move, and orbit motions are converted into camera trajectories and written into the initial noise through trajectory-guided wrapping.

3D-Aware Reward

Depth Anything 3 reconstructs generated clips as 3DGS; Qwen3-VL evaluates meta-views, LPIPS measures re-rendering fidelity, and trajectory scores check camera control.

General Quality Reward

HPSv3 scores the generated frames so RL alignment improves geometry without sacrificing aesthetic quality, subject consistency, and motion smoothness.

Periodic Decoupled Training

Every 100 steps, the 3D-aware reward is temporarily disabled and the model is optimized on roughly 500 high-entropy dynamic prompts with the general reward only.

04

3D Reconstruction Diagnostics

High-fidelity reconstruction from World-R1
World-R1 produces dense and stable 3D reconstructions from generated videos.
Reconstruction failure from an inconsistent baseline video
Geometric hallucinations in baseline videos lead to sparse point clouds and unstable reconstructions.
Meta-view reward visualization
Meta-view evaluation exposes 3D failures that can appear plausible in the original view.

05

Dataset Taxonomy

Natural Landscapes

  • Landforms
  • Water Features
  • Weather & Time

Urban and Architectural

  • Urban Landscapes
  • Indoor Spaces
  • Infrastructure

Micro and Still Life

  • Desktop Still Life
  • Micro World
  • Material Representation

Fantasy and Surrealism

  • Non-Euclidean and physics-defying scenes

Artistic Styles

  • Stylized renderings beyond photorealism

Dynamic Data Subset

  • High-entropy scenes for periodic dynamic tuning

06

Quantitative Results

3D consistency table
Table 1. Reconstruction-based geometry consistency evaluation.
VBench results
Table 2. General video quality on VBench.
Ablation study
Ablation on reward components and training strategy.

3D Reconstruction

27.67 PSNR / 0.865 SSIM / 0.162 LPIPS

World-R1-Large substantially improves geometry consistency over Wan2.1-T2V-14B.

Small Variant

+10.23 dB PSNR

World-R1-Small reaches 27.63 PSNR and 0.858 SSIM on the 3D consistency benchmark.

VBench

65.74 aesthetic / 67.53 imaging

The RL-aligned model preserves general video quality while improving subject consistency to 97.58.

07

Human and Robustness Analyses

User Study

92% geometry / 76% control / 86% overall

25 participants compared World-R1 with Wan2.1 on 30 complex prompts in a blind 2AFC setup.

Metric Validation

91.17% agreement

Human 3D-consistency preference aligns with the automatic metric ranking across 20 participants and 30 randomized pairs.

Long Video

121-frame generalization

World-R1-Large improves long-video PSNR from 18.32 to 26.32 against the Wan2.1-T2V-14B backbone.

08

Baseline Comparison Videos

World-R1 vs baseline models on representative prompts.