Magma pretraining pipeline. For all training data, texts are tokenized into tokens, while images and videos from different domains are encoded by a shared vision encoder. The resulted discrete and continuous tokens are then fed into a LLM to generate the outputs in verbal, spatial and action types.
Set-of-Mark (SoM) for Action Grounding. Set-of-Mark prompting enables effective action grounding in images for both UI screenshot (left), robot manipulation (middle) and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space.
Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions.
Overview of Pretraining Data Sources: instructional videos, robotics manipulation, UI navigation, and multimodal understanding. We apply SoM and ToM for different data types, with SoM enabling unified action grounding across all modalities while ToM is specifically applied to video and robotics data.
Magma plans the future trajectories for robot manipulation tasks.
Task: Pick up the chip bag.
Task: Push chip bag to left edge of table.
Task: Pick up the coke can.
Ask multi-modal models to PLAY a 2D game: Collects the green blocks by automatically moving up, down, left and right.
Magma Agent vs. LLaVA-OneVision Agent
Magma Agent vs. Qwen2-VL Agent
Magma Agent vs. GPT-4o mini Agent
Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
What's weather in Seattle? & turn on flight mode
Share and message this to Bob Steve, click send button to complete
Efficient finetuning on Mind2Web for web UI navigation.
Efficient finetuning on AITW for mobile UI navigation.
WidowX videos show real robot rollouts with Magma model. (Videos are sped up by 2.5x.)
Magma
OpenVLA
Pick Place Hotdog Sausage
Put Mushroom Place Pot
Push Cloth Left to Right (Out-of-Dist.)
Few-shot finetuning on WidowX robot. Magma consistently outperforms OpenVLA (finetuning) across soft object manipulation and pick-and-place operations, demonstrating reliable performance in both in-distribution and out-of-distribution generalization tasks on real robots.
Magma on Real Robot
Magma on LIBERO
Few-shot finetuning on LIBERO. Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance.
Zero-shot SimplerEnv evaluation on Google Robots and Bridge. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.
Spatial reasoning evaluations. Spatial reasoning questions are challenging even for GPT-4o but Magma can answer relatively well despite relying on much fewer pretraining data.
Zero-shot Video QA benchmarks. Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.
User: I really want to win this game. Can you suggest the next move that I should make to maximize my chances of winning?
Magma: I suggest moving the pawn to d4 to maximize your chances of winning.
User: I want to relax for a few hours in here. Can you suggest some things for me to do?
Magma: Watch TV, read a book, and enjoy the view of the autumn forest.
User: Please provide a detailed description of the video, focusing on the main subject and the actions taking place as well as predict what is going to happen next.
Magma: The video takes place in a convenience store, focusing on the checkout counter. It begins with a view of the counter displaying a sign that reads 'PICK 5 FOR $8.00' along with images of various snacks. The counter is filled with various items, including snacks and a payment terminal. An employee, wearing a black shirt with white sleeves, is seen handling the payment terminal. A customer, identifiable by their white sleeve and colorful bracelet, places a red shopping basket filled with snacks and a drink on the counter. The employee continues to operate the payment terminal while the customer adjusts the items in the basket. The interaction between the customer and the employee is highlighted, with the employee assisting the customer with the items. The scene continues at the same checkout counter, with the 'PICK 5 FOR $8.00' sign still prominently displayed. The customer's red shopping basket, now filled with snacks including a drink cup labeled 'Fruit Swoosh' and a blue package of cookies, is placed on the counter. The employee is seen handling the payment terminal. The customer's hand, adorned with a colorful bracelet, interacts with the items in the basket. The employee continues to assist the customer, adjusting the items on the counter. The interaction between the customer and the employee remains the focal point, with the employee assisting the customer with the items. The person behind the counter is likely to process the payment and possibly hand over the items.
User: Please describe in detail what is happening in the video by focusing on the main actions.
Magma: The video begins with a close-up of a hand reaching for a red potted plant placed on a wooden shelf. The plant has green and yellow-striped leaves and is accompanied by a book titled 'YOUNG' and a wooden bowl stacked on top of each other. The hand carefully adjusts the position of the plant on the shelf, ensuring it is properly placed. The background includes a beige watering can hanging on a stand and a radiator with white pipes. The hand then adjusts the plant once more, making sure it is centered on the shelf. The scene develops as the hand, now wearing a yellow sweater and a ring on the left hand, reaches out to adjust the plant's position once more. The hand then moves away from the plant, and the camera shifts to show more of the surrounding area, including a red high chair and a wooden countertop with various items, such as a smartphone and a small red cup. The video wraps up with a wider view of the room, revealing a checkered floor and several other plants in the background, adding a homely touch to the setting.
User: Please describe in detail what is happening in the video by focusing on the main actions.
Magma: The video begins with a close-up view of a kitchen sink area. The sink has a shiny silver faucet with water running, and a person is holding a black mug under the running water. On the left side of the sink, there is a bottle of dish soap and a blue toothbrush. The person then picks up a blue sponge and starts scrubbing the inside of the mug. The scene continues with the person methodically scrubbing the mug, ensuring it is clean. The video wraps up with the person placing the scrubbed mug in the sink and the blue sponge beside it, indicating that the cleaning process is complete.
User: What drink did the person choose in the video?
Magma: The person chose a soy beverage labeled as 'SAVE SOY'.
User: Can you predict the next actions that the person is going to carry out beyond the video?
Magma: The person is likely about to pour hot water into the cup and proceed with the tea preparation process.
User: Can you describe what the person is doing in the video?
Magma: The person is interacting with the coffee machine by pressing buttons, likely to make a coffee choice.
@misc{yang2025magmafoundationmodelmultimodal,
title={Magma: A Foundation Model for Multimodal AI Agents},
author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
year={2025},
eprint={2502.13130},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.13130},
}