Real-World Data for CALVIN

CALVIN evaluates language-conditioned multi-step manipulation in simulation. Real-world data addresses the compounding error problem that PyBullet physics and simplified visuals cannot replicate.

CALVIN at a Glance

Unique Tasks

Max Chain Length

7-DOF

Action Space

Training Scenes

30 Hz

Control Frequency

2022

Released

CALVIN Task Suite Overview

CALVIN's 34 tasks span drawer manipulation, block sliding, stacking, button pressing, and LED control. Tasks are grouped by interaction type and can be chained in arbitrary order.

Task Category	Example Tasks	Observation Modality	Difficulty
Drawer Manipulation	Open drawer, close drawer	RGB + proprioception	Medium
Block Sliding	Push block left/right, slide block to target	RGB + proprioception	Easy
Block Stacking	Stack block, unstack block	RGB + proprioception	Hard
Lifting & Placing	Lift colored block, place on slider	RGB + proprioception	Medium
Switch & LED	Toggle switch, turn on/off LED, change LED color	RGB + proprioception	Easy
Lever Rotation	Rotate lever left/right	RGB + proprioception	Medium

CALVIN vs. Related Benchmarks

How CALVIN compares to other language-conditioned and multi-step manipulation benchmarks on key dimensions.

Feature	CALVIN	LIBERO	Language-Table	RLBench
Language conditioning	Free-form natural language	Templated language goals	Simple verb-noun instructions	Task name only
Sequential evaluation	1-5 task chains	10-step suites (reset between)	Single task	Single task
Environment reset	No reset between chain tasks	Reset between suite tasks	Reset per episode	Reset per episode
Number of tasks	34	130	~10 verbs	100
Simulation engine	PyBullet	MuJoCo (robosuite)	PyBullet	CoppeliaSim

Benchmark Profile

CALVIN (Composing Actions from Language and Vision) is a benchmark for evaluating language-conditioned multi-step manipulation. Created by Oier Mees et al. at the University of Freiburg in 2022, it tests whether robots can chain together long sequences of manipulation actions guided by natural language instructions in a simulated tabletop environment built on PyBullet.

Task Set

34 unique manipulation tasks chainable into sequences of 1 to 5 steps. Tasks include sliding blocks, pushing buttons, rotating levers, lifting objects, stacking, toggling switches, opening and closing drawers, and LED color interactions. The benchmark evaluates how many sequential tasks a policy can complete without failure or environment reset.

Observation Space

RGB images from a static third-person camera (200x200) and a wrist-mounted gripper camera (84x84), proprioceptive state comprising 7 joint angles plus gripper width, and natural language task descriptions. A structured scene observation is also available with 3D positions of all interactive objects.

Action Space

7-DOF relative end-effector actions: 3D position delta, 3D orientation delta (Euler angles), and binary gripper open/close. Actions are executed at 30 Hz control frequency on a simulated Franka Panda arm.

Evaluation Protocol

Average length of successfully completed task chains across 1,000 evaluation sequences. Each sequence requests up to 5 tasks in order; the policy scores higher by completing longer unbroken chains. Environments are split into 4 training scenes (A-D) with different object configurations, and a held-out evaluation scene (D) to test generalization. The primary metric is the average number of tasks completed in a row (1-5 scale).

The Sim-to-Real Gap

CALVIN's PyBullet simulation uses simplified contact models where objects snap into stable configurations. Real-world sequential manipulation requires recovering from compounding errors across steps — a small positioning error in step 1 cascades through steps 2-5. CALVIN also uses uniform lighting and simple textures, lacking the visual complexity of real kitchens and workspaces. The reset-free evaluation partially captures real chaining dynamics but misses actuator drift, object state estimation errors, and the physical fatigue effects of extended manipulation sequences.

Real-World Data Needed

Long-horizon manipulation recordings with natural language annotations, showing multi-step task completion in real environments. Critical needs include demonstrations where errors compound and recovery is demonstrated, authentic visual complexity with clutter and varying lighting, diverse language instructions for the same task sequences, and demonstrations across multiple environment layouts to match CALVIN's multi-scene evaluation protocol.

Complementary Claru Datasets

Egocentric Activity Dataset

Human activity video shows long-horizon task completion with natural recovery from errors — the real-world analog of CALVIN's chained task evaluation. Captured across 100+ locations with naturally varying visual conditions.

Manipulation Trajectory Dataset

Real-world manipulation with temporal annotations provides authentic multi-step task data for training policies that handle compounding errors across sequential manipulation.

Custom Language-Paired Collection

Purpose-collected demonstrations with concurrent natural language narration provide the language-action grounding that CALVIN's evaluation protocol specifically measures.

Bridging the Gap: Technical Analysis

CALVIN addresses a critical limitation of single-task benchmarks: real robots must chain tasks together, and errors from one task affect the next. A policy that can open a drawer with 90% success and place an object with 85% success has only ~76% success at the two-step chain, and ~51% at a five-step chain. This compounding error problem makes long-horizon manipulation fundamentally harder than single tasks.

The sim-to-real challenge for CALVIN is compounded by this sequential structure. Real-world object states change unpredictably after each manipulation step — objects shift, rotate, or partially fall. The robot must perceive these state changes accurately and adapt subsequent actions. In CALVIN's simulation, object states are precisely known; in reality, state estimation errors add another source of compounding failure.

CALVIN's multi-environment design (scenes A through D) is intended to test visual generalization. However, the visual variation between simulated scenes is minimal compared to the gap between any simulated scene and a real-world kitchen. Models that generalize across CALVIN scenes may still fail catastrophically when confronted with real textures, reflections, and lighting.

The language conditioning component adds further complexity. The instruction 'put the red block in the drawer' has many valid execution strategies depending on drawer state, block position, and surrounding clutter. In simulation, language-to-action grounding benefits from simplified perception. In reality, the language grounding must handle ambiguity, partial occlusion, and objects unseen during training.

Real-world language-conditioned manipulation data addresses these gaps directly. Human demonstrations of multi-step kitchen tasks, for example, naturally include the kind of error recovery and adaptation that CALVIN evaluates. A human making a sandwich handles bread that tears, ingredients that shift, and tools that slip — exactly the robustness that CALVIN's sequential evaluation demands. Claru's egocentric activity dataset captures these interactions authentically across diverse environments.

Key Papers

[1]Mees et al.. “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.” RA-L 2022, 2022. Link
[2]Mees et al.. “What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data.” RA-L 2022, 2022. Link
[3]Shi et al.. “SUSIE: Subgoal Synthesis via Image Editing for Language-Conditioned Control.” CoRL 2024, 2024. Link
[4]Mees et al.. “Grounding Language with Visual Affordances over Unstructured Data.” ICRA 2023, 2023. Link
[5]Ha et al.. “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition.” CoRL 2023, 2023. Link

Frequently Asked Questions

When tasks are chained sequentially without environment reset, each task's success rate multiplies. A 90% single-task rate becomes ~59% over 5 sequential tasks. Real-world conditions worsen this because object states change unpredictably after each step — a block shifted during step 1 is no longer where the policy expects it for step 3. This compounding dynamic is CALVIN's core evaluation insight and the primary reason high simulation scores do not predict deployment reliability.

Language adds ambiguity that simulation sidesteps. The instruction 'put the red block in the drawer' has many valid execution strategies depending on drawer state, block orientation, and surrounding objects. In CALVIN's simulation, object identities and positions are perfectly known. In the real world, language grounding must handle visual ambiguity, partial occlusion, objects not seen during training, and spatial references that depend on viewpoint.

Real-world multi-step demonstrations show natural error recovery — adjusting grip when objects shift, re-approaching when initial grasps fail, adapting plans when task preconditions change. This recovery behavior is absent from simulation demonstrations where grasps either succeed or the episode ends. Training on real recoveries produces more robust sequential policies that maintain longer CALVIN-style task chains on physical hardware.

CALVIN provides four distinct scenes with different table textures, object placements, and background colors. Scenes A through C (or A through D depending on the evaluation protocol) are used for training, while the held-out scene tests visual generalization. The visual variation between scenes is controlled — same objects, different arrangements — making the sim-to-real gap much larger than the inter-scene gap.

CALVIN provides over 24 hours of teleoperated play data collected across its four scenes, consisting of approximately 350,000 transition frames with paired language annotations. This data includes both task-directed demonstrations and exploratory play, enabling research on learning from unstructured interaction data — a pattern increasingly relevant for real-world robot learning.