VITRA:
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Qixiu Li^*†,1,2, Yu Deng^*,2, Yaobo Liang^*,2, Lin Luo^*,2,
Lei Zhou^†,2, Chengtang Yao², Lingqi Zeng^†,2, Zhiyuan Feng^†,1,2, Huizhi Liang^†,1,2, Sicheng Xu², Yizhong Zhang², Xi Chen², Hao Chen², Lily Sun², Dong Chen², Jiaolong Yang^‡,2, Baining Guo²

^*Equal contribution ^†Interns at Microsoft Research ^‡Project lead. Email: {jiaoyan}@microsoft.com
¹Tsinghua University ²Microsoft Research Asia

Paper arXiv Code 🤗 Model 🗄️ Data

Overview of DexVLA framework and training data pipeline

VITRA is a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.

Transforming Human Hand Video to VLA Data

Real-life human videos contain abundant examples of everyday actions and interactions across diverse environments. However, these videos are typically unstructured : they are unscripted and unsegmented, vary in length and task granularity, include irrelevant or noisy actions, and lack language instructions and 3D action labels. In contrast, existing robotic manipulation V-L-A data typically consist of simple, short-horizon tasks (e.g., "pick up the sponge on table", "wipe the stove with cloth" ), each of which comprises a language instruction, a video frame sequence, and frame-aligned 3D action chunks of the end-effector in the robot or camera coordinate system. We develop a framework that analyzes unscripted human videos and converts them into VLA-formatted data, treating the two human hands as end-effectors and thereby effectively leveraging the potential of human videos for generalizable VLA pretraining.

Visualization of our hand V-L-A data

The left part in each case visualizes the reconstructed 3D hands for individual frames. The middle part shows the hand action trajectory from the current frame to the end of the episode, with a color gradient from purple to yellow indicating the temporal progression throughout the episode. The rightmost part depicts the hand's 3D trajectory projected into the camera coordinate system at the first frame.

Left hand: Pick up the wooden sticks.
Right hand: Pick up and gather the wooden pieces.

Ego4D_18a3840b-7463-43c4-9aa9-b1d8e486fa84_ep_001351

Ego4D_002d2729-df71-438d-8396-5895b349e8fd_ep_000442

Ego4D_01d793c6-e235-4af1-8a1d-c70dcf0c8654_ep_000039

Ego4D_03cc49c3-a7d1-445b-9a2a-545c4fae6843_ep_000026

Ego4D_0a02a1ed-a327-4753-b270-e95298984b96_ep_000465

Ego4D_0a40e971-02a4-45f3-aa46-b20bfef1f8e1_ep_001092

Ego4D_0b0c7c26-8f38-4a22-8a9e-b99e4ae334fc_ep_000182

Ego4D_0cd3698b-9f75-4a43-a8b3-a3bd25b9dc52_ep_000771

Ego4D_0cd3698b-9f75-4a43-a8b3-a3bd25b9dc52_ep_000960

Ego4D_0de39e75-fb19-47d4-818d-fff874b05ab9_ep_000349

Ego4D_17b62651-e5e3-4d06-8911-86becc7f7ac3_ep_000053

Ego4D_18a2984e-01d1-41f1-8c0e-2b314c8c233f_ep_000177

Ego4D_18a3840b-7463-43c4-9aa9-b1d8e486fa84_ep_001016

Ego4D_19db1542-5537-46cd-82d1-7f72e356a2a9_ep_000513

Ego4D_1b1279dc-dfae-4e3a-aa46-acc48776e402_ep_000266

Ego4D_1b204c9b-30da-47cc-b793-8813169e72c7_ep_000040

Ego4D_1d8b1b3c-b8f3-4a3d-9137-65b4b15dd343_ep_000047

Ego4D_1d95a5e3-3030-41e5-8bf8-a2233984f7e2_ep_001554

Ego4D_1dc26044-e6a9-4e02-9707-968fe77a4f93_ep_000015

Ego4D_1ea95f25-81f9-46b1-b7a1-c4ca58251c6f_ep_000017

Ego4D_1ece4937-dc98-4d2a-ba7a-82504528157d_ep_000130

Ego4D_1f6e0f73-25ad-4f0e-b9c4-36dee6647aee_ep_000302

Ego4D_2bd5c29e-ef5c-451c-aa2e-961d4257de9b_ep_000433

Ego4D_2c7fc799-1d40-49da-b5a2-213bac8ea420_ep_000020

Ego4D_2c7fc799-1d40-49da-b5a2-213bac8ea420_ep_000295

Ego4D_2d29b45a-169b-453f-8050-98ec147b0ccf_ep_000096

Ego4D_2d29b45a-169b-453f-8050-98ec147b0ccf_ep_000558

Ego4D_2f9e8979-a4c2-4856-a148-85365f6028b5_ep_000109

Ego4D_2ff3fee9-1073-4201-b0c5-2cb0292e4d00_ep_000163

Ego4D_35c31ba5-2b1a-40da-a5cc-e098465759b8_ep_000105

Ego4D_37d183ce-ba64-4ad2-8a54-ae2fb791fd38_ep_000034

Ego4D_37e18a34-1494-46bb-b5f4-239aa31ee2ea_ep_000026

Ego4D_38f872c3-7300-4c7c-a2be-614ddb415ce7_ep_000150

Ego4D_3be3b4b0-bba1-4f79-84a0-6dcaee0e64e0_ep_000105

Ego4D_3c0dffd0-e38e-4643-bc48-d513943dc20b_ep_000085

Ego4D_3d035e35-2bd5-4359-9744-7c035460088f_ep_001625

Ego4D_3ee7070a-fe81-49a9-97a2-902268af9985_ep_000069

Ego4D_3fa897ac-3489-46a4-9e24-e84ce88e9607_ep_000013

Ego4D_42d548dc-bb10-427f-a1fe-468f31bc0de5_ep_000038

Ego4D_4b63760e-3016-4a13-ad93-c9487a433a4c_ep_000046

Ego4D_4c57dedb-267e-4d35-8be3-a6d4a3e2b7c4_ep_000133

Ego4D_4cac8e34-cac1-408c-85a0-b42bd3e550d7_ep_000579

Ego4D_4e55f2de-d0ce-4d72-ac0b-6e0ef70b0067_ep_000108

Ego4D_5aacbf62-3948-45bc-9a41-3470296da1bb_ep_000144

Ego4D_5aacbf62-3948-45bc-9a41-3470296da1bb_ep_000250

Ego4D_5afe1e42-d44f-4c1d-bfd1-84de861b4492_ep_000047

Ego4D_5d9261fb-7d80-4cb0-b601-38c17fbf6f75_ep_001815

Ego4D_5d9261fb-7d80-4cb0-b601-38c17fbf6f75_ep_001827

Ego4D_6a513338-0c76-418f-b842-e89547decacf_ep_000137

Ego4D_6ad19348-15b0-4411-b544-8a3fac59dd21_ep_000723

Ego4D_6b91041f-bd60-4738-8305-46b144ce725d_ep_000141

Ego4D_6e3f9af6-0acf-45b0-91d2-ccd0e114edcc_ep_000164

Ego4D_6e8d8309-821a-4b0a-8e3f-6cd74f59cb44_ep_000073

Ego4D_7bc2ea44-9168-4379-a17e-5558d4e2fc24_ep_000675

Ego4D_7c5eb718-99b2-48ac-bb1c-ef0158a6361b_ep_000061

Ego4D_7cf19d84-9445-4758-bc03-ee5f0b2e9c8f_ep_000620

Ego4D_8fa1b5f5-989f-486a-aa5e-3eaca91aee1e_ep_000456

Ego4D_9a473135-6125-4081-9327-6c4afb4348af_ep_000259

Ego4D_9ad4f150-15fe-4e1b-b145-331475afbe0c_ep_000378

Ego4D_9c05c378-1692-4670-a47f-3e24557cbb1c_ep_000412

Ego4D_9c502ef6-41cd-4037-a6be-216587ee8c3e_ep_000069

Ego4D_9c502ef6-41cd-4037-a6be-216587ee8c3e_ep_000347

Ego4D_9d72fb6b-ba44-45fc-ab70-1588a572ebfe_ep_000565

Ego4D_9e224525-92eb-4c39-b7dc-61dac1a01611_ep_000146

1 / 1

Diversity Analysis

To investigate data diversity, we conduct a detailed analysis of the visual observations and language instructions in our hand V-L-A dataset. We compare the word frequency of language instructions across different datasets (including ours, Egodex, Agibot World beta, DROID, OXE ), as well as the diversity of DINO features of images.

(a)~(c): Language instruction statistics across different VLA datasets. (d): t-SNE visualization of image features.

Scaling Behavior

We investigate how the scale of pretraining data influences human hand action prediction performance and the performance on real robot. We compare the model trained on the full pretrained dataset with those trained on sub-sampled datasets at different ratios.

The circle size indicates the visual diversity of the pretraining data. (a) Data scaling behavior on the grasping task of zero-shot human hand action prediction in unseen real-life environments. (b) Task success rate on real-robot pick-and-place tasks with seen objects and backgrounds. (c) Task success rate on real-robot pick-and-place tasks with unseen objects and backgrounds.

Zero-shot Human Hand Action Prediction in Unseen Environments

We evaluate the performance of our pretrained VLA model on human hand action prediction in completely unseen environments. Several visualization examples are presented, where the model infers based on single egocentric images captured from our real-life environments that are not included in the training data. In each video, the hand executes only one complete action chunk. It is worth noting that a single action chunk does not necessarily complete the entire task, and executing a full chunk at once may lead to a loss of precision; the demonstrations here are provided solely for visualization purposes.

Grasp the black plug on the wall

Grasp the cardboard box

Grasp the white bag with cat picture on it

Pick up the black phone

Pick up the blue razor

Pick up the green sauce bottle

Pick up the grey clothes

Pick up the metal basin

Pick up the metal lunch box

Pick up the pink bucket

Pick up the plastic ruler

Pick up the plate in the microwave oven

Pick up the silver cup

Pick up the soap dispenser with blue cap

Pick up the soap with heart shape in the soap dish

Pick up the water filter

Pick up the white clothes in the upper drawer

Pick up the white doll

Pick up the white electric kettle

Pick up the white hat on the bed

Pick up the white round bottle with blue cap

Pick up the yellow and orange jar

Pick up the yellow dish soap bottle with red cap

Pick up the yellow scissors

Close the laptop screen down

Cut the plant leaves with the scissors

Pour the water into the pink cup

Cut the meat with the knife

Disconnect the white plug

Lift the pan from the stove

Open the cabinet door

Open the closet door

Open the oven door

Pick the towel up from the laundry basket

Pick up the food from the bowl with chopsticks

Mix the food in the bowl with chopsticks

Pick up the gray cup from the cabinet

Pinch the tissue and pull to the right.

Place the beverage on the tray

Place the knife into the wooden knife holder

Pour the water from the glass teacup into the plate

Pour water on the plant.

Press the blue liquid soap

Press the soap dispenser

Pull open the black oven door by its handle

Pull open the drawer

Pull out the drawer by its handle

Pull out the tissue from the box

Push the chair toward the table

Put the drink bottle in the trask.

Scrub the tile floor with the brush

Slide open the door

Take the milk bottle out of the fridge

Throw the garlic into the trash can

Wipe the countertop with the cloth

Wipe the range hood with the paper towel

1 / 1

Real-World Robot Dexterous Manipulation

After pretraining, the model can be fine-tuned on small-scale robot data for deployment. We consider the human hand action space as a superset of that of the robot hand and align the robot's action space with the human hand's.

General Pick & Place

Examples of the robot executing tasks involving General Pick & Place, driven by our model.

Functional Grasping

Examples of the robot executing tasks involving Functional Grasping, driven by our model.

Sequential Tasks

Examples of the robot executing tasks following multiple instructions in a row, driven by our model.

General Pick & Place

Examples of the robot executing tasks involving General Pick & Place, driven by our model.

Functional Grasping

Examples of the robot executing tasks involving Functional Grasping, driven by our model.

Sequential Tasks

Examples of the robot executing tasks following multiple instructions in a row, driven by our model.

Quantitative Experiments

Comparison with prior arts and ablations

We evaluate the performance of our VLA model fine-tuned on a small set of real robot trajectories for dexterous manipulation tasks. We compare our approach against: (1) VPP, (2) π₀ , (3) a model without our V-L-A data pretraining, (4) a model pretrained on our dataset using LAPA, and (5) a model pretrained on OXE.

Success rates on seen real-world robot dexterous manipulation tasks (in %).

Success rates on unseen real-world robot dexterous manipulation tasks (in %).

BibTeX

        @article{li2025vitra,
        title={Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos},
        author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo},
        journal={arXiv preprint arXiv:2510.21571},
        year={2025}
        }