Enhancing Latent Action Modeling in VLAs

Xiaoyu Chen^2*†, Hangxing Wei^3*†, Pushi Zhang^1*, Chuheng Zhang^1*, Kaixin Wang^1*,
Yanjiang Guo², Rushuai Yang^4†, Yucen Wang^5†, Xinquan Xiao^2†, Li Zhao^1*§, Jianyu Chen², Jiang Bian¹

^*Equal contribution ^†Interns at Microsoft Research ^§Project lead

¹Microsoft Research ²Tsinghua University ³Wuhan University ⁴Hong Kong University of Science and Technology ⁵Nanjing University

📃 Paper

Code 🤗 Model

TL;DR: villa-X bridges high-level planning and low-level execution via latent actions.

: a Visual-Language-Latent-Action Model

At a high level, our villa-X consists of two main components:

a LAM (Latent Action Model) module that infers latent actions from a pair of observations;
an ACT (Actor) module that jointly models latent actions and robot actions given an initial visual observation and a textual task instruction.

The LAM consists of an Inverse Dynamic Model (IDM), a visual Forward Dynamic Model (FDM), and a proprio Forward Dynamic Model (proprio FDM). The IDM takes in 2 adjacent frames, frame \(o_t\) and frame \(o_{t+K}\), and outputs a latent action frame \(z_t\). Then, the latent action frame \(z_t\) and the first frame frame \(o_t\) are passed into the FDM to predict a reconstructed frame frame \(\hat{o}_{t+K}\). At the same time, the latent action frame \(z_t\) and the low-level robot states frame \(q_t\) are passed into the proprio FDM to predict future robot states frame \(\hat{q}_{t+1}\), \(\cdots\), \(\hat{q}_{t+K}\) and robot actions frame \(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+K-1}\).

The ACT module is built upon a pre-trained Vision-Language-Model (VLM). The VLM first process the given frame frame \(o_t\) and the language instruction frame \(l\) to produce latent features. Then, conditioned on these latent features (through attention), the ACT-latent adopts a diffusion-based model to predict a sequence of \(n\) latent actions frame \(\hat{z}_t\), \(\cdots\), \(\hat{z}_{t+(n-1)K}\). Concurrently, the ACT-robot predicts a sequence of \(m\) robot actions frame \(\hat{a}_{t}\), \(\cdots\), \(\hat{a}_{t+m-1}\), conditioned on the internal features of both VLM and ACT-latent.

Both LAM and ACT are trained on a large collection of robotic data and human egocentric videos, including Open X-Embodiment, Something-Something V2, Ego4D, and others.

Motion planning with latent actions

Given an image frame and a language instruction frame , we use the trained ACT-latent to plan latent actions frame . To visualize the results, we train a separate world model to render future frames resulting from a sequence of latent actions. Below, we show the predicted outcomes of applying the latent action sequences generated by ACT-latent from different language instructions, all conditioned on the same input image.

Click on different instructions to view the corresponding outcomes. In-distribution refers to samples from the validation set of our training data, while Out-of-distribution refers to samples from the Realman robot arm, a new embodiment that was never seen during training. The results demonstrate that ACT-latent successfully follows the language instructions to solve the tasks, while accurately identifying target objects and generating latent actions that successfully follow the instructions. ACT-latent also successfully identifies the concepts in the emoji, which rarely appears in robot datasets, suggesting that villa-X keeps the general vision-language capabilities in the initial VLM model after pre-training.

In-distribution

Open top drawer
Open middle drawer
Open bottom drawer

Move orange near rxbar blueberry
Move orange near the apple

Pick up the brush
Put the tomato in the pot
Slide the purple towel

Move the silver pot to the left burner
Pick the green fruit into the pot
Pick up the spoon

Place the red object in the silver pot
Place the red object into the basket
Place the red object into the sink

Put the blue fork near the banana
Put the blue fork on the blue cloth

Out-of-distribution

Touch the corn
Touch the bowl
Touch the leaf

Pick up the spoon
Pick up the cup

Pick up the blue block
Pick up the green block

Learning consistent latent actions

Our learned latent actions demonstrate consistent transferability. To illustrate this, we first use the trained LAM to extract latent actions frame from a source video. These actions are then applied to a different initial image using the world model to render the resulting future frames. Below, we present several examples showcasing the application of the same sequence of latent actions (left video) across different contexts (right video).

Additionally, we use the trained proprio FDM to predict future robot actions frame based on the current robot state frame and extracted latent actions frame . The predictions are then rendered within the simulator.

Render with a world model

Render with the proprio FDM and simulator

Finishing manipulation tasks in simulation and the real world

Finally, following the common VLA evaluation protocols, we evaluated the performance of our method in completing manipulation tasks across both simulated and real-world environments. Below are several demonstrations showcasing its capabilities.

Simulation

Place apple in closed top drawer

Put the black bowl in the bottom drawer of the cabinet and close it

Put eggplant in basket

Pick horizontal coke can

Put both the alphabet soup and the cream cheese box in the basket

Put spoon on table cloth

Close drawer

Turn on the stove and put the moka pot on it

Put carrot on plate

Real World - Realman Robot Arm

Push the green block to
position 1

Push the green block to
position 4

Put the green block from blue bowl
onto table

Put the green block on table
into blue bowl

Stack the wooden block
onto the green block

Unstack the wooden block
from the green block

Real World - XHAND1

Pour orange juice into the cup

Pick the onion into the basket

Straighten the cup

Pick the apple into the blue bowl

Flick the ball

Pick the yellow toy into the basket

Stack the blue cube on the red cube

Pick the mango into the green plate

Flick the ball

Experimental Results

SIMPLER

Table 1: Performance on SIMPLER. Methods marked with * are evaluated directly after pretraining, whereas other methods are evaluated after post-training on corresponding dataset.
Method	Google Robot					WidowX Robot
Method	Pick	Move	Drawer	Place	Avg.	Carrot	Eggplant	Spoon	Cube	Avg.
RT-1-X*	56.7	31.7	59.7	21.3	42.4	4.2	0.0	0.0	0.0	1.1
Octo-base*	17.0	4.2	22.7	0.0	11.0	8.3	43.1	12.5	0.0	16.0
OpenVLA*	16.3	46.2	35.6	0.0	24.5	0.0	4.1	0.0	0.0	1.0
RoboVLMs*	72.7	66.3	26.8	36.1	50.5	25.0	0.0	20.8	8.3	13.5
RoboVLM	77.3	61.7	43.5	24.1	51.7	20.8	79.2	45.8	4.2	37.5
GR00T	0.7	1.9	2.9	0.0	1.4	0.0	13.9	1.4	0.0	3.8
MoTo	74.0	60.4	43.1	N/A	N/A	N/A	N/A	N/A	N/A	N/A
LAPA	N/A	N/A	N/A	N/A	N/A	45.8	58.3	70.8	54.2	57.3
Ours w/o latent	56.3	25.8	27.3	13.9	30.8	31.3	74.6	61.7	28.3	49.0
Ours	98.7	75.0	59.3	5.6	59.6	46.3	64.6	77.9	61.3	62.5

LIBERO

Table 2: Performance on 4 LIBERO task suites.
Method	Spatial	Object	Goal	Long	Average
Diffusion Policy	78.3	92.5	68.3	50.5	72.4
Octo-base	78.9	85.7	84.6	51.1	75.1
OpenVLA	84.7	88.4	79.2	53.7	76.5
Ours w/o latent	86.0	86.5	85.0	70.0	81.9
Ours	97.5	97.0	91.5	74.5	90.1

Real robots

Table 3: Performance on Realman robot arm.
Method	Pick in	Pick out	Push	Stack	Unstack	Change block color	Change table cover
GR00T	30	70	10	10	60	50	30
Ours w/o latent	40	80	30	60	70	40	30
Ours	30	100	50	50	100	60	60

Table 4: Performance on Xarm robot arm with XHAND dexterous hand.
Method	Pick & Place		Stack Cube		Place Cup Upright		Pour Water		Flick Ball
Method	seen	unseen	seen	unseen	seen	unseen	seen	unseen	seen	unseen
GR1	56	40	15	5	0	0	0	0	40	10
GR00T	44	28	20	0	20	0	0	0	30	0
Ours w/o latent	72	60	70	40	40	30	40	10	50	30
Ours	84	68	75	50	60	30	60	30	50	40

BibTeX


      @article{chen2025villa0x0,
        title   = {villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models},
        author  = {Xiaoyu Chen and Hangxing Wei and Pushi Zhang and Chuheng Zhang and Kaixin Wang and Yanjiang Guo and Rushuai Yang and Yucen Wang and Xinquan Xiao and Li Zhao and Jianyu Chen and Jiang Bian},
        year    = {2025},
        journal = {arXiv preprint arXiv: 2507.23682}
      }

This website is adapted from NeRFies under a Creative Commons Attribution-ShareAlike 4.0 International License.