Latent Spatial Memory

for Video World Models

Weijie Wang1* Haoyu Zhao1* Yifan Yang2 Feng Chen3 Zeyu Zhang1 Yefei He1 Zicheng Duan3 Donny Y. Chen4 Yuqing Yang2 Bohan Zhuang1

1 Zhejiang University · 2 Microsoft Research · 3 Adelaide University · 4 Monash University * Equal contribution

10.57× faster generation 55× lower 3D cache memory 70.36 WorldScore average

Promo Video

A short preview of video world modeling with persistent latent spatial memory.

Latent Spatial Memory

Mirage stores static scene content as 3D latent tokens, then reads and updates that cache directly during generation.

Latent spatial memory compared with RGB point-cloud memory
Latent spatial memory keeps persistent 3D scene context directly in latent space, avoiding the RGB render-and-reencode loop used by point-cloud memory.

Mirage Architecture

Mirage initializes, reads, and updates a persistent latent spatial memory.

Mirage architecture with cache initialization readout update pipeline
Mirage architecture initializes a latent cache from the first frame, reads it for each target view, and writes updated static content back across generated chunks.

Efficiency

Mirage reduces repeated 3D cache rendering while preserving strong world modeling quality.

Efficiency comparison of Mirage against video world model baselines
Efficiency comparison showing Mirage's generation speed, cache memory reduction, and WorldScore performance.

Qualitative Evaluation

Each row compares the same trajectory and conditioning across Mirage and four baselines.

BibTeX

@article{wang2026mirage,
  title   = {Latent Spatial Memory for Video World Models},
  author  = {Wang, Weijie and Zhao, Haoyu and Yang, Yifan and Chen, Feng and Zhang, Zeyu and He, Yefei and Duan, Zicheng and Chen, Donny Y. and Yang, Yuqing and Zhuang, Bohan},
  journal = {arXiv preprint arXiv:2606.09828},
  year    = {2026}
}