Real-World Data for VLABench

VLABench tests compositional language understanding for robot manipulation. Real-world data adds the visual ambiguity that simulation lacks.

VLABench at a Glance

100+

Tasks

Compositional

Language Type

VLA

Model Target

Long-Horizon

Reasoning

2024

Released

Language Understanding Axes

VLABench systematically tests different dimensions of language understanding for manipulation.

Language Dimension	Example Instruction	What It Tests
Spatial Relations	Put the block to the left of the cup	Spatial reasoning relative to reference objects
Color/Shape Grounding	Pick up the red cylinder	Visual attribute binding to language descriptions
Comparative Relations	Move the bigger block closer to you	Relative attribute comparison and spatial reference
Compositional Instructions	Put the red block left of the blue one and on top of the green one	Combining multiple relations in a single instruction
Long-Horizon Reasoning	First clear the table, then arrange the blocks by size	Multi-step planning from language specification

VLABench vs. Related Language-Conditioned Benchmarks

Feature	VLABench	CALVIN	Language-Table	CLIPort
Language complexity	Compositional, multi-relation	Free-form natural language	Simple verb-noun	Template-based
Compositional test	Systematic held-out compositions	No explicit composition test	No	No
Spatial reasoning	Primary focus	Implicit in tasks	Minimal	Position specification
Long-horizon	Multi-step instructions	5-step chains	Single step	Single step

Benchmark Profile

VLABench evaluates vision-language-action models on their ability to ground natural language instructions in physical manipulation. It tests VLA models on compositional language understanding — can the model correctly interpret 'put the red block to the left of the blue cylinder' when objects and spatial relations vary?

Task Set

100+ language-conditioned manipulation tasks testing spatial reasoning (left/right/on top/inside), color and shape grounding, comparative relations (bigger, closer), and multi-step instruction following with compositional language.

Observation Space

RGB images from static and wrist cameras, depth maps, proprioceptive state, and natural language instructions with varying complexity.

Action Space

End-effector delta poses with binary gripper control.

Evaluation Protocol

Language-grounded manipulation success rate across held-out language templates, novel object combinations, and unseen spatial configurations. Tests compositional generalization to instructions not seen during training.

The Sim-to-Real Gap

VLABench evaluates language understanding in simulation where object identification is clean and unambiguous. Real-world language grounding must handle visual ambiguity, partial occlusion, distractors, and objects that do not exactly match language descriptions. The simulation visual style lacks photorealistic clutter.

Real-World Data Needed

Language-paired manipulation data in real environments where objects are visually ambiguous, partially occluded, or described imprecisely. Compositional instruction data where spatial relations reference real-world landmarks. Diverse object-language grounding data across many environments and language styles.

Complementary Claru Datasets

Custom Language-Paired Collection

Purpose-collected manipulation demonstrations with concurrent compositional language descriptions provide the real-world language-action grounding VLABench evaluates.

Egocentric Activity Dataset

Real-world activity video provides visual pretraining data with authentic object appearances and environmental context for language grounding.

Manipulation Trajectory Dataset

Diverse manipulation recordings provide the visual foundation for training robust object and spatial relation recognition.

Bridging the Gap: Technical Analysis

VLABench addresses a critical gap in VLA model evaluation: compositional language understanding. Most VLA benchmarks use simple instructions like 'pick up the red block.' VLABench tests whether models understand compositional spatial relations, comparative adjectives, and multi-step instructions.

The compositional generalization test is particularly revealing. A model that learns 'put X left of Y' and 'put X on top of Z' should be able to execute 'put X left of Y and on top of Z' without explicit training on that combination. Real-world instructions are naturally compositional, making this evaluation critical for deployed robots.

However, VLABench's simulation provides clean visual scenes where objects are unambiguously identifiable. Real-world language grounding is harder because objects may be partially occluded, visually similar to distractors, or described imprecisely ('the thing next to the cup'). Real-world language-paired data must capture this ambiguity.

Claru can collect manipulation demonstrations with concurrent compositional language narration in real environments, producing data where language descriptions must be grounded in visually complex scenes with authentic ambiguity.

Key Papers

[1]Zheng et al.. “VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning.” arXiv 2412.18194, 2024. Link
[2]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[3]Shridhar et al.. “CLIPort: What and Where Pathways for Robotic Manipulation.” CoRL 2022, 2022. Link

Frequently Asked Questions

Compositional language means combining known concepts into new instructions. If a robot knows 'left of' and 'on top of,' it should understand 'left of X and on top of Y' without explicit training. VLABench tests this compositional generalization systematically.

Simulation provides clean scenes where objects are unambiguously identifiable by color and shape. Real-world language grounding must handle partial occlusion, visual similarity between objects, imprecise language ('the thing by the cup'), and environmental distractors that simulation scenes lack.

Claru can collect manipulation demonstrations with concurrent natural language narration in diverse real environments. This produces language-action pairs where grounding must handle authentic visual complexity — exactly what VLABench evaluates but in real-world conditions.

Compositional generalization means combining known concepts into novel combinations. If a robot learns 'left of' and 'behind' separately, it should handle 'left of X and behind Y' without explicit training. Real human instructions are naturally compositional, so this capability is essential for robots that must follow verbal commands in deployment.

In simulation, objects have distinct colors and shapes that unambiguously match language descriptions. Real environments contain visually similar objects, partial occlusions, and lighting-dependent appearances. The instruction 'pick up the red thing' might match multiple objects, requiring context-dependent disambiguation that simulation does not test.

Related Resources

Glossary

Vla →

Glossary

Language Conditioned Policy →

Glossary

Scene Understanding →

Guide

How To Build A Language Conditioned Dataset →

Guide

How To Create Action Labels For Vla →

Get Language-Paired Manipulation Data

Discuss compositional language-action data for VLA model training and evaluation.

Get in Touch Browse the Data Catalog