Real-World Data for VLABench

VLABench tests compositional language understanding for robot manipulation. Real-world data adds the visual ambiguity that simulation lacks.

VLABench at a Glance

100+
Tasks
Compositional
Language Type
VLA
Model Target
Long-Horizon
Reasoning
2024
Released

Language Understanding Axes

VLABench systematically tests different dimensions of language understanding for manipulation.

Language DimensionExample InstructionWhat It Tests
Spatial RelationsPut the block to the left of the cupSpatial reasoning relative to reference objects
Color/Shape GroundingPick up the red cylinderVisual attribute binding to language descriptions
Comparative RelationsMove the bigger block closer to youRelative attribute comparison and spatial reference
Compositional InstructionsPut the red block left of the blue one and on top of the green oneCombining multiple relations in a single instruction
Long-Horizon ReasoningFirst clear the table, then arrange the blocks by sizeMulti-step planning from language specification

VLABench vs. Related Language-Conditioned Benchmarks

FeatureVLABenchCALVINLanguage-TableCLIPort
Language complexityCompositional, multi-relationFree-form natural languageSimple verb-nounTemplate-based
Compositional testSystematic held-out compositionsNo explicit composition testNoNo
Spatial reasoningPrimary focusImplicit in tasksMinimalPosition specification
Long-horizonMulti-step instructions5-step chainsSingle stepSingle step

Benchmark Profile

VLABench evaluates vision-language-action models on their ability to ground natural language instructions in physical manipulation. It tests VLA models on compositional language understanding — can the model correctly interpret 'put the red block to the left of the blue cylinder' when objects and spatial relations vary?

Task Set
100+ language-conditioned manipulation tasks testing spatial reasoning (left/right/on top/inside), color and shape grounding, comparative relations (bigger, closer), and multi-step instruction following with compositional language.
Observation Space
RGB images from static and wrist cameras, depth maps, proprioceptive state, and natural language instructions with varying complexity.
Action Space
End-effector delta poses with binary gripper control.
Evaluation Protocol
Language-grounded manipulation success rate across held-out language templates, novel object combinations, and unseen spatial configurations. Tests compositional generalization to instructions not seen during training.

The Sim-to-Real Gap

VLABench evaluates language understanding in simulation where object identification is clean and unambiguous. Real-world language grounding must handle visual ambiguity, partial occlusion, distractors, and objects that do not exactly match language descriptions. The simulation visual style lacks photorealistic clutter.

Real-World Data Needed

Language-paired manipulation data in real environments where objects are visually ambiguous, partially occluded, or described imprecisely. Compositional instruction data where spatial relations reference real-world landmarks. Diverse object-language grounding data across many environments and language styles.

Complementary Claru Datasets

Custom Language-Paired Collection

Purpose-collected manipulation demonstrations with concurrent compositional language descriptions provide the real-world language-action grounding VLABench evaluates.

Egocentric Activity Dataset

Real-world activity video provides visual pretraining data with authentic object appearances and environmental context for language grounding.

Manipulation Trajectory Dataset

Diverse manipulation recordings provide the visual foundation for training robust object and spatial relation recognition.

Bridging the Gap: Technical Analysis

VLABench addresses a critical gap in VLA model evaluation: compositional language understanding. Most VLA benchmarks use simple instructions like 'pick up the red block.' VLABench tests whether models understand compositional spatial relations, comparative adjectives, and multi-step instructions.

The compositional generalization test is particularly revealing. A model that learns 'put X left of Y' and 'put X on top of Z' should be able to execute 'put X left of Y and on top of Z' without explicit training on that combination. Real-world instructions are naturally compositional, making this evaluation critical for deployed robots.

However, VLABench's simulation provides clean visual scenes where objects are unambiguously identifiable. Real-world language grounding is harder because objects may be partially occluded, visually similar to distractors, or described imprecisely ('the thing next to the cup'). Real-world language-paired data must capture this ambiguity.

Claru can collect manipulation demonstrations with concurrent compositional language narration in real environments, producing data where language descriptions must be grounded in visually complex scenes with authentic ambiguity.

Key Papers

  1. [1]Zheng et al.. VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning.” arXiv 2412.18194, 2024. Link
  2. [2]Brohan et al.. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
  3. [3]Shridhar et al.. CLIPort: What and Where Pathways for Robotic Manipulation.” CoRL 2022, 2022. Link

Frequently Asked Questions

Compositional language means combining known concepts into new instructions. If a robot knows 'left of' and 'on top of,' it should understand 'left of X and on top of Y' without explicit training. VLABench tests this compositional generalization systematically.

Simulation provides clean scenes where objects are unambiguously identifiable by color and shape. Real-world language grounding must handle partial occlusion, visual similarity between objects, imprecise language ('the thing by the cup'), and environmental distractors that simulation scenes lack.

Claru can collect manipulation demonstrations with concurrent natural language narration in diverse real environments. This produces language-action pairs where grounding must handle authentic visual complexity — exactly what VLABench evaluates but in real-world conditions.

Compositional generalization means combining known concepts into novel combinations. If a robot learns 'left of' and 'behind' separately, it should handle 'left of X and behind Y' without explicit training. Real human instructions are naturally compositional, so this capability is essential for robots that must follow verbal commands in deployment.

In simulation, objects have distinct colors and shapes that unambiguously match language descriptions. Real environments contain visually similar objects, partial occlusions, and lighting-dependent appearances. The instruction 'pick up the red thing' might match multiple objects, requiring context-dependent disambiguation that simulation does not test.

Get Language-Paired Manipulation Data

Discuss compositional language-action data for VLA model training and evaluation.