Benchmarking Affordance Generalization with BusyBox

Overview

Robot Foundation Models (RFMs), also referred to as Vision-Language Action models (VLAs), have been attracting the attention of researchers and practitioners with a promise of generalizing robot behaviors across tasks, objects, and environments. The community has extensively studied RFMs' generalization capabilities in the vision and language space. However, affordance generalization – RFMs' ability to manipulate new objects with familiar physical features - remains largely unexplored. In the meantime, this meta-skill is plays a critical rule in a person's ability to quickly figure out how to handle hitherto unseen objects. In fact, basic physical interface elements like buttons and switches are designed to look and function similarly across different devices to facilitate affordance generalization in environments inhabited by people. Whether robots can capitalize on these design aids remains unknown: researchers currently lack a benchmark for systematically studing affordance generalization in RFMs.

BusyBox is a physical 3D-printable kit for systematically evaluating how well RFMs generalize their knowledge of basic affordances (pressing buttons, flipping switches, turning knobs, etc). BusyBox can be assembled into any of a multitude of distinct objects having the same set of affordances. Paired with a carefully design protocols for experiments and data collection that we present in this work, BusyBox can provide valuable insights into RFMs' ability to recognize and exploit ubiquitous affordance classes. In our experiments, BusyBox hightlights affordance generalization as a major improvement area for RFMs.

Contributions

BusyBox and CAD files for 3D-printing it: BusyBox's 6 swappable modules with switches, sliders, wires, buttons, a display, and a dial can be easily manufactured in many robotics labs.
Dataset of demonstrations: 1000+ language-annotated demonstration trajectories collected by teleoperating a Mobile ALOHA robot, covering 7 affordance/task families with systematic initial state randomization.
Experiment protocol: A methodology for assessing whether RFMs learn generalizable affordances or merely memorize spatial positions of various UI control elements.
Baseline results: Empirical evidence showing that even state-of-the-art models such as π0 currently struggle with affordance generalization.

BusyBox Hardware

BusyBox consists of six 3D-printable modules that can be easily swapped and rotated to create multiple configurations with the same set of affordances:

Display module: E Ink display with LED indicators and main electronics
Buttons module: Four colored, illuminated buttons
Sliders module: Two adjustable sliders
Knob module: Rotatable knob with 6 positions and handle
Switches module: Two switches with on/off positions
Wire module: Colored wires with pluggable connectors

The modular design enables rapid reconfiguration using snap connectors, allowing researchers to systematically vary the spatial layout while maintaining consistent affordances.

Benchmarking protocol

We use BusyBox to evaluate affordance generalization by finetuning RFMs on affordance demonstration data from a "canonical" BusyBox configuration and testing the resulting models on their ability to perform the demonstrated tasks on reconfigured BusyBox instances.

Task Families

Button tasks: Press colored buttons (by color or position reference)
Slider tasks: Move sliders to specific positions (1-5)
Knob tasks: Rotate knob to positions (1-6) or directions
Switch tasks: Flip switches on/off (requiring bimanual coordination)
Wire pulling tasks: Pull out specific colored wires
Wire insertion tasks: Insert wires into terminals
BusyBox manipulation: Rotate or move the entire BusyBox

Results

Current Robot Foundation Models Fail at Affordance Generalization

Despite high success on conventional pick-and-place tasks, leading VLAs completely fail at BusyBox's affordance generalization challenge. Even after finetuning on our dataset, π0 achieves only 30% success on the training configuration and 0% on novel configurations.

Quantitative Results

Model Configuration	Canonical Success (%)	Config-1 Success (%)	Config-2 Success (%)
π0 Zero-shot	0.0	0.0	0.0
π0 Finetuned	30.0	0.0	0.0

Performance by Task Type (π0 Finetuned on Canonical)

Task Type	Success Rate (%)	Key Failure Mode
Turn Knob	50.0	Overshooting target position
Pull Wire	40.0	Insufficient grip/force
Push Button	40.0	Missing button location
Move Slider	30.0	Imprecise positioning
Flip Switch	20.0	Lack of bimanual coordination
Insert Wire	0.0	Failed alignment/insertion

Note: The finetuned model's failures on non-canonical configurations were primarily due to reaching for wrong module positions, indicating memorization rather than affordance understanding.

Assembly

BusyBox's modular design enables rapid reconfiguration through snap connectors. Modules can be swapped and rotated in under 2 minutes, allowing researchers to quickly create new test configurations. The 3D-printed components are designed for easy replication in any robotics lab with access to a standard 3D printer. Two-filament printing enhances visibility of position markers, though single-color prints with manual highlighting work as an alternative.

Teleoperation Data Collection

We collected 1000+ demonstrations by teleoperating a Mobile ALOHA bimanual robot on the canonical BusyBox configuration. The dataset covers all task families with systematic variation in:

Initial states (slider/knob positions, switch states)
BusyBox position and orientation
Language instruction (color vs position references)
Robot starting poses

Teleoperators followed strict protocols for consistency: efficient movements, active demonstration without unnecessary pauses, and ensuring wrist camera visibility of task-relevant areas.

BibTeX

@misc{busybox2025,
  title={Benchmarking Affordance Generalization with BusyBox},
  author={Fortier, Dean and Adamson, Timothy and Hellebrekers, Tess and LaScala, Teresa and Ennin, Kofi and Murray, Michael and Kolobov, Andrey and Mullins, Galen},
  booktitle={Eval\&Deploy Workshop at CoRL-2025: Evaluation and Deployment Across the Robot Learning Lifecycle},
  year={2025}
}