Existing robotic manipulation V-L-A data typically comprise simple, short-horizon tasks (e.g., "pick up the sponge on table", "wipe the stove with cloth" ), which can be composed to long-horizon tasks by a high-level planner. Each data episode comprises a language instruction, a video frame sequence, and frame-aligned 3D action chunks of the end-effector in the robot or camera coordinate system. Our approach analyzes an unscripted human video and generates V-L-A data in such format, treating the two human hands as the end-effector.
To investigate data diversity, we conduct a detailed analysis of the visual observations and language instructions in our hand V-L-A dataset. We compare the word frequency of language instructions across different datasets (including ours, Egodex, Agibot World beta, DROID, OXE ), as well as the diversity of DINO features of images.
We investigate how the scale of pretraining data influences human hand action prediction performance and the performance on real robot. We compare the model trained on the full pretrained dataset with those trained on sub-sampled datasets at different ratios.
We evaluate the performance of our pretrained VLA model on human hand action prediction in completely unseen environments. Several visualization examples are presented, where the model infers based on single egocentric images captured from our real-life environments that are not included in the training data. In each video, the hand executes only one complete action chunk. It is worth noting that a single action chunk does not necessarily complete the entire task, and executing a full chunk at once may lead to a loss of precision; the demonstrations here are provided solely for visualization purposes.
After pretraining, the model can be fine-tuned on small-scale robot data for deployment. We consider the human hand action space as a superset of that of the robot hand and align the robot's action space with the human hand's.
We evaluate the performance of our VLA model fine-tuned on a small set of real robot trajectories for dexterous manipulation tasks. We compare our approach against: (1) VPP, (2) π0 , (3) a model without our V-L-A data pretraining, (4) a model pretrained on our dataset using LAPA, and (5) a model pretrained on OXE.
Success rates on seen real-world robot dexterous manipulation tasks (in %).
Success rates on unseen real-world robot dexterous manipulation tasks (in %).
@article{li2025vitra,
title={Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos},
author={Qixiu Li and Yu Deng and Yaobo Liang and Lin Luo and Lei Zhou and Chengtang Yao and Lingqi Zeng and Zhiyuan Feng and Huizhi Liang and Sicheng Xu and Yizhong Zhang and Xi Chen and Hao Chen and Lily Sun and Dong Chen and Jiaolong Yang and Baining Guo},
journal={arXiv preprint arXiv:2510.21571},
year={2025}
}