
At a high level, our villa-X consists of two main components:
The LAM consists of an Inverse Dynamic Model (IDM), a visual Forward Dynamic Model (FDM),
and a proprio Forward Dynamic Model (proprio FDM).
The IDM takes in 2 adjacent frames,
\(o_t\) and
\(o_{t+K}\), and outputs a latent action
\(z_t\). Then, the latent action
\(z_t\) and
the first frame
\(o_t\) are passed into the FDM to predict a reconstructed frame
\(\hat{o}_{t+K}\). At the same time, the latent action
\(z_t\) and
the low-level robot states
\(q_t\) are passed into the proprio FDM to predict future robot states
\(\hat{q}_{t+1}\), \(\cdots\), \(\hat{q}_{t+K}\)
and robot actions
\(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+K-1}\).
The ACT module is built upon a pre-trained Vision-Language-Model (VLM).
The VLM first process the given frame
\(o_t\) and the language instruction
\(l\) to produce latent features. Then, conditioned on these latent features (through attention),
the ACT-latent adopts a diffusion-based model to predict a sequence of \(n\) latent actions
\(\hat{z}_t\), \(\cdots\), \(\hat{z}_{t+(n-1)K}\). Concurrently, the ACT-robot predicts
a sequence of \(m\) robot actions
\(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+m-1}\),
conditioned on the internal features of both VLM and ACT-latent.
Both LAM and ACT are trained on a large collection of robotic data and human egocentric videos, including Open X-Embodiment, Something-Something V2, Ego4D, and others.
Given an image
and a language instruction
, we use the trained ACT-latent to plan latent actions
. To visualize the results, we train a separate world model to render future frames resulting from
a sequence of latent actions. Below, we show the predicted outcomes of applying the latent action
sequences generated by ACT-latent from different language instructions,
all conditioned on the same input image.
Click on different instructions to view the corresponding outcomes. In-distribution refers to samples from the validation set of our training data, while Out-of-distribution refers to samples from the Realman robot arm, a new embodiment that was never seen during training. The results demonstrate that ACT-latent successfully follows the language instructions to solve the tasks, while accurately identifying target objects and generating latent actions that successfully follow the instructions. ACT-latent also successfully identifies the concepts in the emoji, which rarely appears in robot datasets, suggesting that villa-X keeps the general vision-language capabilities in the initial VLM model after pre-training.
Our learned latent actions demonstrate consistent transferability. To illustrate this, we first
use the trained LAM to extract latent actions
from a source video. These actions are then applied to a different initial image using the world
model to render the resulting future frames. Below, we present several examples showcasing the
application of the same sequence of latent actions (left video) across different contexts (right video).
Additionally, we use the trained proprio FDM to predict future robot actions
based on the current robot state
and extracted latent actions
. The predictions are then rendered within the simulator.
Finally, following the common VLA evaluation protocols, we evaluated the performance of our method in completing manipulation tasks across both simulated and real-world environments. Below are several demonstrations showcasing its capabilities.
Place apple in closed top drawer
Put the black bowl in the bottom drawer of the cabinet and close it
Put eggplant in basket
Pick horizontal coke can
Put both the alphabet soup and the cream cheese box in the basket
Put spoon on table cloth
Close drawer
Turn on the stove and put the moka pot on it
Put carrot on plate
Push the green block to
position 1
Push the green block to
position 4
Put the green block from blue bowl
onto table
Put the green block on table
into blue bowl
Stack the wooden block
onto the green block
Unstack the wooden block
from the green block
Pour orange juice into the cup
Pick the onion into the basket
Straighten the cup
Pick the apple into the blue bowl
Flick the ball
Pick the yellow toy into the basket
Stack the blue cube on the red cube
Pick the mango into the green plate
Flick the ball
Method | Google Robot | WidowX Robot | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Pick | Move | Drawer | Place | Avg. | Carrot | Eggplant | Spoon | Cube | Avg. | ||
RT-1-X* | 56.7 | 31.7 | 59.7 | 21.3 | 42.4 | 4.2 | 0.0 | 0.0 | 0.0 | 1.1 | |
Octo-base* | 17.0 | 4.2 | 22.7 | 0.0 | 11.0 | 8.3 | 43.1 | 12.5 | 0.0 | 16.0 | |
OpenVLA* | 16.3 | 46.2 | 35.6 | 0.0 | 24.5 | 0.0 | 4.1 | 0.0 | 0.0 | 1.0 | |
RoboVLMs* | 72.7 | 66.3 | 26.8 | 36.1 | 50.5 | 25.0 | 0.0 | 20.8 | 8.3 | 13.5 | |
RoboVLM | 77.3 | 61.7 | 43.5 | 24.1 | 51.7 | 20.8 | 79.2 | 45.8 | 4.2 | 37.5 | |
GR00T | 0.7 | 1.9 | 2.9 | 0.0 | 1.4 | 0.0 | 13.9 | 1.4 | 0.0 | 3.8 | |
MoTo | 74.0 | 60.4 | 43.1 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |
LAPA | N/A | N/A | N/A | N/A | N/A | 45.8 | 58.3 | 70.8 | 54.2 | 57.3 | |
Ours w/o latent | 56.3 | 25.8 | 27.3 | 13.9 | 30.8 | 31.3 | 74.6 | 61.7 | 28.3 | 49.0 | |
Ours | 98.7 | 75.0 | 59.3 | 5.6 | 59.6 | 46.3 | 64.6 | 77.9 | 61.3 | 62.5 |
Method | Spatial | Object | Goal | Long | Average |
---|---|---|---|---|---|
Diffusion Policy | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 |
Octo-base | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 |
OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
Ours w/o latent | 86.0 | 86.5 | 85.0 | 70.0 | 81.9 |
Ours | 97.5 | 97.0 | 91.5 | 74.5 | 90.1 |
Method | Pick in | Pick out | Push | Stack | Unstack | Change block color | Change table cover |
---|---|---|---|---|---|---|---|
GR00T | 30 | 70 | 10 | 10 | 60 | 50 | 30 |
Ours w/o latent | 40 | 80 | 30 | 60 | 70 | 40 | 30 |
Ours | 30 | 100 | 50 | 50 | 100 | 60 | 60 |
Method | Pick & Place | Stack Cube | Place Cup Upright | Pour Water | Flick Ball | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
seen | unseen | seen | unseen | seen | unseen | seen | unseen | seen | unseen | |||||
GR1 | 56 | 40 | 15 | 5 | 0 | 0 | 0 | 0 | 40 | 10 | ||||
GR00T | 44 | 28 | 20 | 0 | 20 | 0 | 0 | 0 | 30 | 0 | ||||
Ours w/o latent | 72 | 60 | 70 | 40 | 40 | 30 | 40 | 10 | 50 | 30 | ||||
Ours | 84 | 68 | 75 | 50 | 60 | 30 | 60 | 30 | 50 | 40 |
@article{chen2025villa0x0,
title = {villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models},
author = {Xiaoyu Chen and Hangxing Wei and Pushi Zhang and Chuheng Zhang and Kaixin Wang and Yanjiang Guo and Rushuai Yang and Yucen Wang and Xinquan Xiao and Li Zhao and Jianyu Chen and Jiang Bian},
year = {2025},
journal = {arXiv preprint arXiv: 2507.23682}
}