HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Our HiSpatial-3B model develops hierarchical 3D spatial intelligence from geometric perception to abstract reasoning, and achieves state-of-the-art results on multiple spatial benchmarks.

Abstract

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.

Large-Scale Hierarchical Spatial Dataset

Our automated pipeline constructs a dataset with 5M images, 45M objects, and over 2B VQA pairs. Explore the samples from different cognitive levels below.

Pixel-Wise 3D Point Querying

Q: what is the xyz of the pixel in <loc0325><loc0951><loc0325><loc0951>

A: 0.61m -0.20m 1.17m

Pairwise Depth Ordering

Q: between the pixel in <loc0725><loc0338><loc0725><loc0338> and the pixel in <loc0352><loc0562><loc0352><loc0562>, which one has a smaller depth value?

A: the pixel in <loc0725><loc0338><loc0725><loc0338> is closer

Object Size Estimation

Q: How tall would you say the faucet appears to be?

A: We estimate the height of the faucet to be 0.11 meters.

Object Size Estimation

Q: How far does the door stretch from side to side?

A: We estimate the width of the door to be 0.77 meters.

Object Distance to Camera

Q: How far is the woman from the camera?

A: 3.85m

Object Grounding

Q: detect the aabb of the woman in white tennis attire holding a racket and standing near the net on the left side of the court

A: -2.57m -1.12m 13.22m -2.06m 0.22m 13.55m

Object Orientation

Q: Which side of the sports car is facing the camera?
A. back B. left C. front D. right

A: D

Distance

Q: What is the distance between the golf club and the man in a light blue polo shirt?

A: 0.72 meters.

Relative Direction

Q: provide the direction vector pointing from the person(highlighted by the red box) to the sign.

A: -0.46 0.18 -0.87

Comparison (orientation)

Q: Would you say the person (highlighted by a red box) and the person (highlighted by a blue box) are oriented similarly?
(A). yes
(B). No

A: (A)

Relative Position

Q: Is there any trash bin can to the left of the person (highlighted by the red box)?

A: No.

Relative Position

Q: Identify the object that is farther from the camera: the glasses or the watch?
I. the watch II. the glasses

A: II

Relative Position

Q: Which relative direction best describes the desktop computer from the box?

A: The desktop computer can be found down of the box.

Comparison(volume)

Q: Identify the largest person.

A: the glasses-wearing woman in striped shirt gently holding a sleeping newborn wrapped in a patterned blanket.

Relative Position

Q: Between the chair and the cello, which one is positioned on the right side?
1. the cello 2. the chair

A: 2

Comparison(depth)

Q: Which ice cream is the nearest to the observer?

A: the pink ice cream with visible red fruit swirls located in the rightmost glass bowl on the wooden tray.

Perspective Taking

Q: From the woman's viewpoint, which side is the plaque positioned most prominently?
(A). above
(B). right
(C). below
(D). Left

A: (B)

Object Counting

Q: How many chair are farther from the observer than the desk?

A: 4.

Perspective Taking

Q: From the man in a dark blue suit's perspective, is the man in a black polo shirt located left or right of the man in a dark blue suit?
(A) right (B). left

A: (B)

Problem Solving

Q: Is there enough space between the chair and the sink for a person to walk through comfortably? Why? Please provide a detailed explanation with cot

A:

Step1: Plan - To answer if there's enough space for a person to walk between the chair and sink, we need the widths of both objects and their center-to-center distance. We'll calculate the minimum edge-to-edge separation using these facts.

Step2: Facts - object_size_width: chair is 0.66m wide. object_size_width: sink is 0.16m wide. object_center_distance: chair and sink are 2.11m apart.

Step3: Reasoning with facts - Using object_center_distance (2.11m), object_size_width (chair is 0.66m, sink is 0.16m), edge-to-edge separation is approximately: 2.11m - (0.66m/2 + 0.16m/2) = 2.11m - (0.33m + 0.08m) = about 1.7m, which exceeds typical comfortable walking width (~1 meter). Therefore, yes, there is enough space.

Step4: Final answer - Yes, there is enough space between the chair and the sink for a person to walk through comfortably; their edges are separated by at least about 1 meter and their centers by over 2 meters.

Problem Solving

Q: Will an object that is 0.5 meters wide fit inside the basket without sticking out? Explain your reasoning step by step

A:
Step1: Plan - We need to know the width of the basket.

Step2: Facts - object_size_width: basket is 0.70m wide.

Step3: Reasoning - A 0.5 meter wide object is less than 0.70m.

Step4: Final answer - Yes, it will fit completely inside without sticking out.

Model Structure

We propose an RGB-D vision-language model that takes metric-scale monocular point maps as input, which can be derived from either depth estimators or ground-truth depth.

Model Inference Visualization

Explore how our model interacts with users and answers diverse 3D spatial queries based on the visual input.

User

Spatial VLM

Ground Truth

Model Performance

Quantitative VQA Benchmarks (Level 1 & 2)

Model	Input	SpatialRGPT-Quantitative ↑						QSpatial-Bench ↑
		Level 1		Level 2			Avg.	Level 1&2
		Width	Height	Direct Distance	Horizontal Distance	Vertical Distance	Avg.	Plus	ScanNet	Avg.
Proprietary Models
GPT-4o	RGB	51.10	68.40	29.70	25.40	33.00	41.50	61.06	69.41	66.30
GPT-5	RGB	59.09	67.66	29.72	25.21	13.72	40.47	63.37	73.53	68.45
Gemini-2.5-Pro	RGB	37.20	43.51	16.78	17.39	15.53	26.57	40.59	55.46	49.92
Open-Source Vision-Language Models
PaliGemma2-3B	RGB	20.3	27.07	20.27	15.57	22.64	21.18	36.63	30.06	32.84
Qwen3VL-8B	RGB	27.06	40.60	31.10	31.96	18.86	30.37	35.64	55.88	48.34
InternVL3.5-8B	RGB	27.81	41.35	31.08	31.96	17.92	30.53	59.41	57.06	57.94
Spatial Specialist Models
RoboRefer-8B-SFT	RGB-D	17.29	60.15	49.32	37.70	28.30	39.25	20.79	31.07	27.24
SpatialRGPT-8B	RGB-D	48.90	61.70	45.90	68.00	56.60	56.22	-	-	-
MM-Spatial-3B	RGB-D	55.60	83.50	59.50	82.00	63.20	68.70	-	-	-
Our Model Variants
HiSpatial-3B-RGB	RGB	69.92	84.96	66.89	71.34	68.87	72.43	76.24	75.88	76.01
HiSpatial-3B	RGB-XYZ	69.17	84.96	79.73	86.89	75.47	79.28	88.12	84.17	85.16
HiSpatial-3B*	RGB-XYZ (GT)	70.68	84.21	83.11	90.16	79.25	81.46	-	84.12	-

Qualitative VQA Benchmarks (Level 1-3)

Method	Input	EmbSpatial ↑	RoboSpatial ↑	CV-Bench-3D ↑	CV-Bench-Relation ↑	3DSRBench ↑
Method	Input	Level 2				Level 1-3
Proprietary Models
GPT-4o	RGB	63.38	77.20	84.90	84.62	44.20
Gemini-2.5-Pro	RGB	76.67	77.24	90.80	93.54	48.47
Claude-3.7-Sonnet	RGB	33.33	60.73	85.00	74.15	48.20
Open-Source Vision-Language Models
PaliGemma2-3B	RGB	28.32	71.54	42.17	72.00	8.15
InternVL3.5-8B	RGB	69.81	78.86	85.09	88.00	51.58
InternVL3.5-14B	RGB	71.62	78.86	86.75	88.15	57.32
Qwen-3-VL-8B	RGB	78.50	82.11	90.66	92.92	52.80
Qwen-3-VL-30B-A3B	RGB	76.40	83.73	92.00	95.38	55.70
Spatial Specialist Models
SpatialBot-3B	RGB-D	50.66	72.36	69.08	69.38	54.67
SpaceLLaVA-13B	RGB	49.40	61.00	68.50	63.69	46.55
SpatialRGPT-8B	RGB-D	59.62	66.67	89.15	91.00	48.40
RoboRefer-8B-SFT	RGB-D	72.53	84.55	95.92	96.90	51.24
Our Model Variants
HiSpatial-3B-RGB	RGB	79.78	83.74	95.58	95.08	64.34
HiSpatial-3B	RGB-XYZ	80.71	86.18	97.58	95.69	63.81

Level Dependency Ablation

Inter-level task dependency analysis reveals that removing lower-level tasks during training consistently reduces higher-level performance.

Auxiliary 3D Input Ablation

Effect of integrating different auxiliary 3D representations. Our proposed absolute XYZ metric space yields significant improvements over standard RGB or relative depth.

RGB Base

RGB + Rel Depth (MoGe2)

RGB + XYZ (Ours, MoGe2)

RGB + XYZ (Ours, GT)

BibTeX

@inproceedings{liang2026hispatial,
  title={HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models},
  author={Liang, Huizhi and Shen, Yichao and Deng, Yu and Xu, Sicheng and Feng, Zhiyuan and Zhang, Tong and Liang, Yaobo and Yang, Jiaolong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}