villa-x

Enhancing Latent Action Modeling in VLAs

*Equal contribution Interns at Microsoft Research §Project lead
1Microsoft Research 2Tsinghua University 3Wuhan University 4Hong Kong University of Science and Technology 5Nanjing University

TL;DR: villa-X bridges high-level planning and low-level execution via latent actions.

villa-X : a Visual-Language-Latent-Action Model

At a high level, our villa-X consists of two main components:

  • a LAM (Latent Action Model) module that infers latent actions from a pair of observations;
  • an ACT (Actor) module that jointly models latent actions and robot actions given an initial visual observation and a textual task instruction.

model architecture

The LAM consists of an Inverse Dynamic Model (IDM), a visual Forward Dynamic Model (FDM), and a proprio Forward Dynamic Model (proprio FDM). The IDM takes in 2 adjacent frames, frame \(o_t\) and frame \(o_{t+K}\), and outputs a latent action frame \(z_t\). Then, the latent action frame \(z_t\) and the first frame frame \(o_t\) are passed into the FDM to predict a reconstructed frame frame \(\hat{o}_{t+K}\). At the same time, the latent action frame \(z_t\) and the low-level robot states frame \(q_t\) are passed into the proprio FDM to predict future robot states frame \(\hat{q}_{t+1}\), \(\cdots\), \(\hat{q}_{t+K}\) and robot actions frame \(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+K-1}\).

The ACT module is built upon a pre-trained Vision-Language-Model (VLM). The VLM first process the given frame frame \(o_t\) and the language instruction frame \(l\) to produce latent features. Then, conditioned on these latent features (through attention), the ACT-latent adopts a diffusion-based model to predict a sequence of \(n\) latent actions frame \(\hat{z}_t\), \(\cdots\), \(\hat{z}_{t+(n-1)K}\). Concurrently, the ACT-robot predicts a sequence of \(m\) robot actions frame \(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+m-1}\), conditioned on the internal features of both VLM and ACT-latent.

Both LAM and ACT are trained on a large collection of robotic data and human egocentric videos, including Open X-Embodiment, Something-Something V2, Ego4D, and others.

Motion planning with latent actions

Given an image frame and a language instruction frame , we use the trained ACT-latent to plan latent actions frame . To visualize the results, we train a separate world model to render future frames resulting from a sequence of latent actions. Below, we show the predicted outcomes of applying the latent action sequences generated by ACT-latent from different language instructions, all conditioned on the same input image.

Click on different instructions to view the corresponding outcomes. In-distribution refers to samples from the validation set of our training data, while Out-of-distribution refers to samples from the Realman robot arm, a new embodiment that was never seen during training. The results demonstrate that ACT-latent successfully follows the language instructions to solve the tasks, while accurately identifying target objects and generating latent actions that successfully follow the instructions. ACT-latent also successfully identifies the concepts in the emoji, which rarely appears in robot datasets, suggesting that villa-X keeps the general vision-language capabilities in the initial VLM model after pre-training.

In-distribution

Open top drawer
Open middle drawer
Open bottom drawer

Move orange near rxbar blueberry
Move orange near the apple

Pick up the brush
Put the tomato in the pot
Slide the purple towel

Move the silver pot to the left burner
Pick the green fruit into the pot
Pick up the spoon

Place the red object in the silver pot
Place the red object into the basket
Place the red object into the sink

Put the blue fork near the banana
Put the blue fork on the blue cloth

Out-of-distribution

Touch the corn
Touch the bowl
Touch the leaf

Pick up the spoon
Pick up the cup

Pick up the blue block
Pick up the green block

Learning consistent latent actions

Our learned latent actions demonstrate consistent transferability. To illustrate this, we first use the trained LAM to extract latent actions frame from a source video. These actions are then applied to a different initial image using the world model to render the resulting future frames. Below, we present several examples showcasing the application of the same sequence of latent actions (left video) across different contexts (right video).

Additionally, we use the trained proprio FDM to predict future robot actions frame based on the current robot state frame and extracted latent actions frame . The predictions are then rendered within the simulator.

Render with a world model

Render with the proprio FDM and simulator

Finishing manipulation tasks in simulation and the real world

Finally, following the common VLA evaluation protocols, we evaluated the performance of our method in completing manipulation tasks across both simulated and real-world environments. Below are several demonstrations showcasing its capabilities.

Simulation

Real World - Realman Robot Arm

Real World - XHAND1

Experimental Results

SIMPLER

Table 1: Performance on SIMPLER. Methods marked with * are evaluated directly after pretraining, whereas other methods are evaluated after post-training on corresponding dataset.
Method Google Robot   WidowX Robot
Pick Move Drawer Place Avg.   Carrot Eggplant Spoon Cube Avg.
RT-1-X* 56.7 31.7 59.7 21.3 42.4   4.2 0.0 0.0 0.0 1.1
Octo-base* 17.0 4.2 22.7 0.0 11.0   8.3 43.1 12.5 0.0 16.0
OpenVLA* 16.3 46.2 35.6 0.0 24.5   0.0 4.1 0.0 0.0 1.0
RoboVLMs* 72.7 66.3 26.8 36.1 50.5   25.0 0.0 20.8 8.3 13.5
RoboVLM 77.3 61.7 43.5 24.1 51.7   20.8 79.2 45.8 4.2 37.5
GR00T 0.7 1.9 2.9 0.0 1.4   0.0 13.9 1.4 0.0 3.8
MoTo 74.0 60.4 43.1 N/A N/A   N/A N/A N/A N/A N/A
LAPA N/A N/A N/A N/A N/A   45.8 58.3 70.8 54.2 57.3
Ours w/o latent 56.3 25.8 27.3 13.9 30.8   31.3 74.6 61.7 28.3 49.0
Ours 98.7 75.0 59.3 5.6 59.6   46.3 64.6 77.9 61.3 62.5

LIBERO

Table 2: Performance on 4 LIBERO task suites.
Method Spatial Object Goal Long Average
Diffusion Policy 78.3 92.5 68.3 50.5 72.4
Octo-base 78.9 85.7 84.6 51.1 75.1
OpenVLA 84.7 88.4 79.2 53.7 76.5
Ours w/o latent 86.0 86.5 85.0 70.0 81.9
Ours 97.5 97.0 91.5 74.5 90.1

Real robots

Table 3: Performance on Realman robot arm.
Method Pick in Pick out Push Stack Unstack Change block color Change table cover
GR00T 30 70 10 10 60 50 30
Ours w/o latent 40 80 30 60 70 40 30
Ours 30 100 50 50 100 60 60
Table 4: Performance on Xarm robot arm with XHAND dexterous hand.
Method Pick & Place   Stack Cube   Place Cup Upright   Pour Water   Flick Ball
seen unseen   seen unseen   seen unseen   seen unseen   seen unseen
GR1 56 40 15 5 0 0 0 0 40 10
GR00T 44 28 20 0 20 0 0 0 30 0
Ours w/o latent 72 60 70 40 40 30 40 10 50 30
Ours 84 68 75 50 60 30 60 30 50 40

BibTeX


      @article{chen2025villa0x0,
        title   = {villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models},
        author  = {Xiaoyu Chen and Hangxing Wei and Pushi Zhang and Chuheng Zhang and Kaixin Wang and Yanjiang Guo and Rushuai Yang and Yucen Wang and Xinquan Xiao and Li Zhao and Jianyu Chen and Jiang Bian},
        year    = {2025},
        journal = {arXiv preprint arXiv: 2507.23682}
      }
    
This website is adapted from NeRFies under a Creative Commons Attribution-ShareAlike 4.0 International License.
Contact Us on GitHub Privacy & Cookies Consumer Health Privacy Terms of Use Trademarks © 2025 Microsoft