VITRA:
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Qixiu Li*†,1,2, Yu Deng*,2, Yaobo Liang*,2, Lin Luo*,2,
Lei Zhou†,2, Chengtang Yao2, Lingqi Zeng†,2, Zhiyuan Feng†,1,2, Huizhi Liang†,1,2, Sicheng Xu2, Yizhong Zhang2, Xi Chen2, Hao Chen2, Lily Sun2, Dong Chen2, Jiaolong Yang‡,2, Baining Guo2
*Equal contribution   Interns at Microsoft Research   Project lead. Email: {jiaoyan}@microsoft.com
1Tsinghua University  2Microsoft Research Asia 
Overview of DexVLA framework and training data pipeline

VITRA is a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.

Transforming Human Hand Video to VLA Data

Existing robotic manipulation V-L-A data typically comprise simple, short-horizon tasks (e.g., "pick up the sponge on table", "wipe the stove with cloth" ), which can be composed to long-horizon tasks by a high-level planner. Each data episode comprises a language instruction, a video frame sequence, and frame-aligned 3D action chunks of the end-effector in the robot or camera coordinate system. Our approach analyzes an unscripted human video and generates V-L-A data in such format, treating the two human hands as the end-effector.


Visualization of our hand V-L-A data

The left part in each case visualizes the reconstructed 3D hands for individual frames. The middle part shows the hand action trajectory from the current frame to the end of the episode, with a color gradient from purple to yellow indicating the temporal progression throughout the episode. The rightmost part depicts the hand's 3D trajectory projected into the camera coordinate system at the first frame.


Left hand: Pick up the wooden sticks.
Right hand: Pick up and gather the wooden pieces.
1 / 1

Diversity Analysis

To investigate data diversity, we conduct a detailed analysis of the visual observations and language instructions in our hand V-L-A dataset. We compare the word frequency of language instructions across different datasets (including ours, Egodex, Agibot World beta, DROID, OXE ), as well as the diversity of DINO features of images.



(a)~(c): Language instruction statistics across different VLA datasets. (d): t-SNE visualization of image features.

Scaling Behavior

We investigate how the scale of pretraining data influences human hand action prediction performance and the performance on real robot. We compare the model trained on the full pretrained dataset with those trained on sub-sampled datasets at different ratios.



The circle size indicates the visual diversity of the pretraining data. (a) Data scaling behavior on the grasping task of zero-shot human hand action prediction in unseen real-life environments. (b) Task success rate on real-robot pick-and-place tasks with seen objects and backgrounds. (c) Task success rate on real-robot pick-and-place tasks with unseen objects and backgrounds.

Zero-shot Human Hand Action Prediction in Unseen Environments

We evaluate the performance of our pretrained VLA model on human hand action prediction in completely unseen environments. Several visualization examples are presented, where the model infers based on single egocentric images captured from our real-life environments that are not included in the training data. In each video, the hand executes only one complete action chunk. It is worth noting that a single action chunk does not necessarily complete the entire task, and executing a full chunk at once may lead to a loss of precision; the demonstrations here are provided solely for visualization purposes.



1 / 1

Real-World Robot Dexterous Manipulation

After pretraining, the model can be fine-tuned on small-scale robot data for deployment. We consider the human hand action space as a superset of that of the robot hand and align the robot's action space with the human hand's.




General Pick & Place

Examples of the robot executing tasks involving General Pick & Place, driven by our model.

Functional Grasping

Examples of the robot executing tasks involving Functional Grasping, driven by our model.

Sequential Tasks

Examples of the robot executing tasks following multiple instructions in a row, driven by our model.

General Pick & Place

Examples of the robot executing tasks involving General Pick & Place, driven by our model.

Functional Grasping

Examples of the robot executing tasks involving Functional Grasping, driven by our model.

Sequential Tasks

Examples of the robot executing tasks following multiple instructions in a row, driven by our model.


Quantitative Experiments

Comparison with prior arts and ablations

We evaluate the performance of our VLA model fine-tuned on a small set of real robot trajectories for dexterous manipulation tasks. We compare our approach against: (1) VPP, (2) π0 , (3) a model without our V-L-A data pretraining, (4) a model pretrained on our dataset using LAPA, and (5) a model pretrained on OXE.


Success rates on seen real-world robot dexterous manipulation tasks (in %).

Success rates on unseen real-world robot dexterous manipulation tasks (in %).

BibTeX

        @article{li2025vitra,
        title={Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos},
        author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo},
        journal={arXiv preprint arXiv:2510.21571},
        year={2025}
        }