HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

CVPR 2026

Huizhi Liang1, 2, *, ‡ Yichao Shen2, 3, *, ‡ Yu Deng2 Sicheng Xu2 Zhiyuan Feng1, 2, ‡ Tong Zhang4 Yaobo Liang2 Jiaolong Yang2

1 Tsinghua University

2 Microsoft Research Asia

3 IAIR, Xi'an Jiaotong University

4 University of the Chinese Academy of Sciences

* Equal contribution

Work done during internship at MSRA

teaser

Our HiSpatial-3B model develops hierarchical 3D spatial intelligence from geometric perception to abstract reasoning, and achieves state-of-the-art results on multiple spatial benchmarks.

Abstract

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

Large-Scale Hierarchical Spatial Dataset

Our automated pipeline constructs a dataset with 5M images, 45M objects, and over 2B VQA pairs. Explore the samples from different cognitive levels below.

Pixel-Wise 3D Point Querying
Pixel-Wise 3D Point Querying

Q: what is the xyz of the pixel in <loc0325><loc0951><loc0325><loc0951>

A: 0.61m -0.20m 1.17m

Pairwise Depth Ordering
Pairwise Depth Ordering

Q: between the pixel in <loc0725><loc0338><loc0725><loc0338> and the pixel in <loc0352><loc0562><loc0352><loc0562>, which one has a smaller depth value?

A: the pixel in <loc0725><loc0338><loc0725><loc0338> is closer

Object Height Estimation
Object Size Estimation

Q: How tall would you say the faucet appears to be?

A: We estimate the height of the faucet to be 0.11 meters.

Object Width Estimation
Object Size Estimation

Q: How far does the door stretch from side to side?

A: We estimate the width of the door to be 0.77 meters.

Distance to Camera
Object Distance to Camera

Q: How far is the woman from the camera?

A: 3.85m

Object Grounding
Object Grounding

Q: detect the aabb of the woman in white tennis attire holding a racket and standing near the net on the left side of the court

A: -2.57m -1.12m 13.22m -2.06m 0.22m 13.55m

Object Orientation
Object Orientation

Q: Which side of the sports car is facing the camera?
A. back B. left C. front D. right

A: D

Object to Object Distance
Distance

Q: What is the distance between the golf club and the man in a light blue polo shirt?

A: 0.72 meters.

Object to Object Direction
Relative Direction

Q: provide the direction vector pointing from the person(highlighted by the red box) to the sign.

A: -0.46 0.18 -0.87

Object to Object Direction
Comparison (orientation)

Q: Would you say the person (highlighted by a red box) and the person (highlighted by a blue box) are oriented similarly?
(A). yes
(B). No

A: (A)

Object Existence
Relative Position

Q: Is there any trash bin can to the left of the person (highlighted by the red box)?

A: No.

Front/Back Position
Relative Position

Q: Identify the object that is farther from the camera: the glasses or the watch?
I. the watch II. the glasses

A: II

Relative Position
Relative Position

Q: Which relative direction best describes the desktop computer from the box?

A: The desktop computer can be found down of the box.

Relative Position
Comparison(volume)

Q: Identify the largest person.

A: the glasses-wearing woman in striped shirt gently holding a sleeping newborn wrapped in a patterned blanket.

Left/Right Position
Relative Position

Q: Between the chair and the cello, which one is positioned on the right side?
1. the cello 2. the chair

A: 2

Spatial Description
Comparison(depth)

Q: Which ice cream is the nearest to the observer?

A: the pink ice cream with visible red fruit swirls located in the rightmost glass bowl on the wooden tray.

Perspective Taking
Perspective Taking

Q: From the woman's viewpoint, which side is the plaque positioned most prominently?
(A). above
(B). right
(C). below
(D). Left

A: (B)

Object Counting
Object Counting

Q: How many chair are farther from the observer than the desk?

A: 4.

Perspective Taking
Perspective Taking

Q: From the man in a dark blue suit's perspective, is the man in a black polo shirt located left or right of the man in a dark blue suit?
(A) right (B). left

A: (B)

Problem Solving
Problem Solving

Q: Is there enough space between the chair and the sink for a person to walk through comfortably? Why? Please provide a detailed explanation with cot

A:

Step1: Plan - To answer if there's enough space for a person to walk between the chair and sink, we need the widths of both objects and their center-to-center distance. We'll calculate the minimum edge-to-edge separation using these facts.

Step2: Facts - object_size_width: chair is 0.66m wide. object_size_width: sink is 0.16m wide. object_center_distance: chair and sink are 2.11m apart.

Step3: Reasoning with facts - Using object_center_distance (2.11m), object_size_width (chair is 0.66m, sink is 0.16m), edge-to-edge separation is approximately: 2.11m - (0.66m/2 + 0.16m/2) = 2.11m - (0.33m + 0.08m) = about 1.7m, which exceeds typical comfortable walking width (~1 meter). Therefore, yes, there is enough space.

Step4: Final answer - Yes, there is enough space between the chair and the sink for a person to walk through comfortably; their edges are separated by at least about 1 meter and their centers by over 2 meters.

Problem Solving
Problem Solving

Q: Will an object that is 0.5 meters wide fit inside the basket without sticking out? Explain your reasoning step by step

A:
Step1: Plan - We need to know the width of the basket.

Step2: Facts - object_size_width: basket is 0.70m wide.

Step3: Reasoning - A 0.5 meter wide object is less than 0.70m.

Step4: Final answer - Yes, it will fit completely inside without sticking out.

Model Structure

We propose an RGB-D vision-language model that takes metric-scale monocular point maps as input, which can be derived from either depth estimators or ground-truth depth.

model structure

Model Inference Visualization

Explore how our model interacts with users and answers diverse 3D spatial queries based on the visual input.

Inference visual
User

Spatial VLM

Ground Truth

Model Performance

Quantitative VQA Benchmarks (Level 1 & 2)

Model Input SpatialRGPT-Quantitative ↑ QSpatial-Bench ↑
Level 1 Level 2 Avg. Level 1&2
Width Height Direct
Distance
Horizontal
Distance
Vertical
Distance
Plus ScanNet Avg.
Proprietary Models
GPT-4o RGB 51.1068.4029.7025.4033.00 41.50 61.0669.4166.30
GPT-5 RGB 59.0967.6629.7225.2113.72 40.47 63.3773.5368.45
Gemini-2.5-Pro RGB 37.2043.5116.7817.3915.53 26.57 40.5955.4649.92
Open-Source Vision-Language Models
PaliGemma2-3B RGB 20.327.0720.2715.5722.64 21.18 36.6330.0632.84
Qwen3VL-8B RGB 27.0640.6031.1031.9618.86 30.37 35.6455.8848.34
InternVL3.5-8B RGB 27.8141.3531.0831.9617.92 30.53 59.4157.0657.94
Spatial Specialist Models
RoboRefer-8B-SFT RGB-D 17.2960.1549.3237.7028.30 39.25 20.7931.0727.24
SpatialRGPT-8B RGB-D 48.9061.7045.9068.0056.60 56.22 ---
MM-Spatial-3B RGB-D 55.6083.5059.5082.0063.20 68.70 ---
Our Model Variants
HiSpatial-3B-RGB RGB 69.9284.9666.8971.3468.87 72.43 76.2475.8876.01
HiSpatial-3B RGB-XYZ 69.1784.9679.7386.8975.47 79.28 88.1284.1785.16
HiSpatial-3B* RGB-XYZ (GT) 70.6884.2183.1190.1679.25 81.46 -84.12-

Qualitative VQA Benchmarks (Level 1-3)

Method Input EmbSpatial ↑ RoboSpatial ↑ CV-Bench-3D ↑ CV-Bench-Relation ↑ 3DSRBench ↑
Level 2 Level 1-3
Proprietary Models
GPT-4o RGB 63.38 77.20 84.90 84.62 44.20
Gemini-2.5-Pro RGB 76.67 77.24 90.80 93.54 48.47
Claude-3.7-Sonnet RGB 33.33 60.73 85.00 74.15 48.20
Open-Source Vision-Language Models
PaliGemma2-3B RGB 28.32 71.54 42.17 72.00 8.15
InternVL3.5-8B RGB 69.81 78.86 85.09 88.00 51.58
InternVL3.5-14B RGB 71.62 78.86 86.75 88.15 57.32
Qwen-3-VL-8B RGB 78.50 82.11 90.66 92.92 52.80
Qwen-3-VL-30B-A3B RGB 76.40 83.73 92.00 95.38 55.70
Spatial Specialist Models
SpatialBot-3B RGB-D 50.66 72.36 69.08 69.38 54.67
SpaceLLaVA-13B RGB 49.40 61.00 68.50 63.69 46.55
SpatialRGPT-8B RGB-D 59.62 66.67 89.15 91.00 48.40
RoboRefer-8B-SFT RGB-D 72.53 84.55 95.92 96.90 51.24
Our Model Variants
HiSpatial-3B-RGB RGB 79.78 83.74 95.58 95.08 64.34
HiSpatial-3B RGB-XYZ 80.71 86.18 97.58 95.69 63.81

Level Dependency Ablation

Inter-level task dependency analysis reveals that removing lower-level tasks during training consistently reduces higher-level performance.

Auxiliary 3D Input Ablation

Effect of integrating different auxiliary 3D representations. Our proposed absolute XYZ metric space yields significant improvements over standard RGB or relative depth.

RGB Base
RGB + Rel Depth (MoGe2)
RGB + XYZ (Ours, MoGe2)
RGB + XYZ (Ours, GT)

BibTeX

@inproceedings{liang2026hispatial,
  title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
  author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}