CALVIN Alternative: Real-World Training Data for Production Robotics

CALVIN pioneered language-conditioned, long-horizon manipulation benchmarking with 34 tasks across 4 simulated environments. But simulation-only play data cannot prepare policies for real-world deployment. Compare CALVIN with Claru's production-grade, real-world data collection.

CALVIN Profile

Institution

University of Freiburg

Year

2022

Scale

24 hours of play data, 34 tasks, 389 language instructions across 4 simulated environments, ~20K annotated language directives

License

MIT License

Modalities

Simulated RGB-D (static + gripper cameras)Proprioception (joint angles, velocities)Vision-based tactile sensingNatural language task instructions (389 unique, ~20K total directives)

How Claru Helps Teams Beyond CALVIN

CALVIN has been instrumental in advancing language-conditioned, long-horizon manipulation research. Its multi-environment evaluation protocol and sequential task chaining provide a uniquely demanding benchmark that has driven progress in combining language understanding with sustained robotic execution -- over 60 papers now cite CALVIN for evaluation. However, CALVIN's purpose is algorithmic benchmarking in simulation, not data for production training. The play data format, while innovative for learning diverse skills, produces trajectories that mix successful and incomplete actions without task-level ground truth until hindsight relabeling is applied. The simulated environments, while structurally varied across the A-D split, lack the visual complexity, physics fidelity, and sensor noise of real deployment settings. Policies that achieve high CALVIN scores routinely fail when transferred to real robots -- the sim-to-real gap in rendering, contact dynamics, and depth sensing is too large for direct transfer. Claru bridges these gaps by collecting task-directed demonstrations on your physical robot with natural language annotations from diverse human annotators. Our demonstrations are segmented by task with validated success outcomes, providing clean training signal for both single-step and multi-step policies. For teams that have developed their architecture on CALVIN and need to deploy on real hardware, Claru provides the real-world data -- with real depth noise, real contact physics, and real linguistic variety -- that makes the transition from benchmark performance to production reliability.

What Is CALVIN?

CALVIN (Composing Actions from Language and Vision) is a simulated benchmark for language-conditioned, long-horizon robot manipulation, developed by Oier Mees, Lukas Hermann, and Wolfram Burgard at the University of Freiburg. Published in IEEE Robotics and Automation Letters in 2022 (arXiv:2112.03227), CALVIN was designed to evaluate agents that can solve many manipulation tasks over extended horizons, specified only via natural language, from onboard sensor observations.

The benchmark includes approximately 24 hours of teleoperated unstructured play data together with 389 unique language instructions for 34 tasks across 4 manipulation environments (labeled A, B, C, D). These environments share structural similarities but differ in object positions, colors, and scene configurations, enabling evaluation of how well agents generalize across visual variations. Tasks include sliding, pushing, rotating, stacking, lifting, and using switches and drawers. The play data approach -- where operators perform free-form manipulation without specific task instructions -- was specifically designed to produce naturally diverse trajectories that cover multiple skills within single episodes, addressing the artificial uniformity of earlier task-directed collection protocols.

CALVIN's sensor suite is rich for a simulation benchmark: RGB-D images from both a static third-person camera and a gripper-mounted camera, proprioceptive information (joint angles, velocities), and vision-based tactile sensing. The simulation uses a Franka Emika Panda arm operating in a PyBullet-based physics environment. Each environment contains a tabletop scene with a sliding door, a drawer, a switch (LED light), a light switch, and several colored blocks -- providing sufficient object diversity for multi-task evaluation while remaining reproducible across research groups.

CALVIN's most distinctive contribution is its long-horizon evaluation protocol, which chains up to 5 tasks sequentially and measures cumulative success. An agent receives a sequence of language instructions (e.g., 'push the blue block right', 'turn on the light', 'open the drawer') and must execute them in order without resetting the environment between tasks. This evaluation structure tests whether policies can maintain coherent behavior across task boundaries -- a capability that single-task benchmarks cannot assess. The standard train-on-ABC / test-on-D protocol further tests generalization to unseen environment configurations. Released under the MIT License, CALVIN has become a standard evaluation platform adopted by over 60 research papers studying long-horizon, language-conditioned manipulation.

CALVIN at a Glance

Manipulation Tasks

Simulated Environments (A-D)

24 hrs

Play Data

389

Language Instructions

20K

Language Directives

MIT

License

CALVIN vs. Claru: Side-by-Side Comparison

A detailed comparison across the dimensions that matter for production deployment of language-conditioned manipulation.

Dimension	CALVIN	Claru
Data Source	Simulated play data (PyBullet)	Real-world teleoperated demonstrations
Scale	24 hours of play data across 34 tasks	1K to 1M+ task-directed demonstrations
Robot Platform	Simulated Franka Panda only	Any physical robot you deploy
Camera Setup	Static + gripper RGB-D (simulated)	Configurable multi-view with real cameras
Language Conditioning	389 unique instructions for 34 tasks	Free-form natural language with multi-annotator diversity
Environment Diversity	4 procedurally varied simulation scenes	Real environments matching your deployment
Sensor Modalities	Simulated RGB-D, proprioception, tactile	Real RGB + depth + force/torque + proprioception + tactile
Data Structure	Unstructured play (multi-skill episodes)	Task-directed demonstrations with success validation
License	MIT License	Commercial license with IP assignment
Long-Horizon Evaluation	Chain up to 5 tasks sequentially	Custom multi-step task sequences for your workflow
Physics Fidelity	Simplified PyBullet dynamics	Real-world contact dynamics with no sim gap
Visual Realism	Synthetic rendering with uniform lighting	Real imagery with natural lighting variation

Architecture of the CALVIN Benchmark

CALVIN's data collection protocol is based on unstructured play, a concept inspired by developmental psychology where operators interact freely with the environment without specific task goals. During collection, teleoperators explored the tabletop scene for extended periods, manipulating objects, operating mechanisms (drawers, switches), and performing sequences of actions that naturally span multiple task categories. This produces episodes where a single trajectory might contain pushing a block, opening a drawer, and toggling a switch -- all without explicit task boundaries.

The benchmark then applies hindsight language labeling: after collection, language annotators watch the recorded trajectories and retroactively assign natural language descriptions to segments where specific tasks were completed. This produces the 389 unique language instructions paired with trajectory segments, organized into 34 task categories. Each task category has approximately 11 unique language variations (e.g., 'push the red block left', 'slide the red block to the left', 'move the red block leftward'), though the linguistic variation is constrained by the template-based annotation protocol.

The four environments (A, B, C, D) share the same table and robot but differ in object colors, positions, and background textures. The standard evaluation protocol trains on environments A, B, and C and tests on D, measuring whether the policy generalizes to unseen visual configurations. This cross-environment evaluation is CALVIN's most demanding test and has proven to be a significant discriminator between methods -- many approaches that achieve high success on seen environments (A, B, C) drop substantially on the unseen environment D.

The long-horizon evaluation metric -- completing chains of 1 to 5 sequentially specified tasks -- has become CALVIN's signature contribution to the field. A score of '5/5' means the agent completed all 5 tasks in sequence. This metric exposed a critical weakness in early methods: many could complete individual tasks reliably but failed to maintain coherent state tracking across task boundaries, for example leaving the gripper in a state that prevented executing the next task. Methods like HULC (Mees et al., 2022), SuSIE (Black et al., 2024), and GR-1 have progressively improved on this metric, with current state-of-the-art approaches achieving average chain completion lengths above 3.5 out of 5.

Key Limitations of CALVIN for Production Use

CALVIN's fundamental limitation is the sim-to-real gap. All data exists in a PyBullet-based simulation with simplified rendering, idealized physics, and synthetic textures. Policies that score well on CALVIN's long-horizon evaluation chains (completing up to 5 tasks sequentially) routinely fail on physical hardware where lighting varies, objects have real material properties, and contact dynamics are stochastic. The visual gap is particularly severe: CALVIN's rendering produces uniform lighting, clean textures, and perfect edges that bear little resemblance to real camera footage with noise, reflections, and variable illumination.

The unstructured play data format, while valuable for learning diverse skills, creates challenges for production training. Play data mixes successful and incomplete actions within single trajectories, with no task-level segmentation or success labeling until hindsight relabeling is applied. The hindsight relabeling process itself is imperfect -- annotators may disagree on where a task starts and ends, and partially completed actions can be misclassified as successes or failures. Extracting clean, task-directed demonstrations from play data requires hindsight relabeling -- an active research area, not a solved engineering problem.

CALVIN's 34 tasks cover a narrow manipulation vocabulary focused on tabletop interactions: pushing, sliding, rotating, and basic pick-and-place. Production deployments typically require dozens to hundreds of task types including assembly, tool use, deformable object handling, and contact-rich manipulation that CALVIN does not address. The task set is also heavily biased toward rigid object manipulation in uncluttered scenes -- the tabletop contains only a few objects at a time, with ample free space between them.

The 389 language instructions, while diverse relative to earlier benchmarks, represent a limited linguistic distribution. Real users issue instructions with far greater variability, ambiguity, and context-dependence. Policies trained on CALVIN's instruction set overfit to its particular phrasings and struggle with natural language in deployment. The template-based annotation protocol means that instruction variation is largely syntactic rather than truly diverse -- multiple annotators might produce different word orders, but the underlying semantic templates are shared.

CALVIN's depth data is perfect synthetic depth with no noise, missing data, or sensor artifacts. Real depth sensors (Intel RealSense, Azure Kinect) fail on reflective surfaces, transparent objects, and at material boundaries -- challenges that simulation does not prepare policies to handle. Similarly, CALVIN's proprioception is noise-free joint state, while real robots exhibit encoder quantization, vibration artifacts, and calibration drift.

When to Use CALVIN vs. Commercial Data

CALVIN is the right choice when evaluating long-horizon, language-conditioned manipulation architectures. Its structured train-on-ABC / test-on-D protocol provides a clean generalization benchmark, and its multi-task sequential evaluation measures whether models maintain coherent behavior across task chains -- a capability that most benchmarks cannot evaluate. If you are publishing a paper on long-horizon planning, CALVIN provides the standardized evaluation that reviewers expect and that enables fair comparison with prior work.

CALVIN is also useful for studying how agents learn from unstructured play data without explicit task labels. If your research explores hindsight relabeling, skill discovery from play, or self-supervised task decomposition, CALVIN's play data format is specifically designed for these investigations. The availability of both annotated segments and raw play data enables controlled experiments on the value of language supervision.

Switch to Claru when you need language-conditioned manipulation policies that work on physical robots. Real-world language-conditioned systems must handle real visual complexity, real contact dynamics, and the full breadth of natural language variation. Claru collects task-directed demonstrations on your robot with natural language annotations from multiple annotators, producing the grounded language-action correspondence that simulation cannot provide. The transition from CALVIN to real-world deployment is not a fine-tuning step -- it requires fundamentally different training data that captures the physics, visual complexity, and linguistic diversity of real environments.

How Claru Complements CALVIN

Claru provides the real-world grounding that transforms CALVIN-developed architectures into deployable products. Where CALVIN offers simulated play data with template-based language instructions, Claru collects task-directed demonstrations on your physical robot with diverse natural language annotations generated by multiple human annotators describing the same behavior in their own words. This annotation diversity produces the robust language grounding that deployment requires.

For teams that have validated their long-horizon planning approach on CALVIN, Claru delivers the fine-tuning data that closes the sim-to-real gap. Our demonstrations capture real depth sensor noise, real contact dynamics, and real environmental variability -- the factors that determine whether a long-horizon policy can maintain coherent execution across multiple real-world task steps. The CALVIN-to-deployment transition is not just about visual domain adaptation; it requires learning how real objects behave when grasped, pushed, and stacked in ways that simulation physics cannot accurately model.

Claru also scales beyond CALVIN's 34-task vocabulary. We collect demonstrations for whatever manipulation tasks your deployment requires, including contact-rich operations, deformable objects, and tool use that CALVIN's simulation cannot support. Every demonstration is validated for task success and annotated with natural language at the diversity level real-world language conditioning demands.

Data is delivered in RLDS, HDF5, zarr, or LeRobot format with standardized schemas compatible with the same training pipelines used for CALVIN-based research. For teams using HULC, SuSIE, or other CALVIN-native architectures, we provide data in the exact episode format these methods expect, with task segmentation, language pairing, and observation structure that enables direct integration without preprocessing pipeline changes.

References

[1]Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.. “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.” IEEE RA-L 2022 / arXiv 2112.03227, 2022. Link
[2]Mees, O., Hermann, L., Burgard, W.. “HULC: Grounding Language with Visual Affordances over Unstructured Data.” ICRA 2022 / arXiv 2210.01911, 2022. Link
[3]Black, K., Nakamoto, M., Atreya, P., et al.. “Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models.” ICLR 2024 / arXiv 2310.10639, 2024. Link
[4]Chi, C., Feng, S., Du, Y., et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023 / arXiv 2303.04137, 2023. Link
[5]Lynch, C., Khansari, M., Xiao, T., et al.. “Learning Latent Plans from Play.” CoRL 2019 / arXiv 1903.01973, 2019. Link
[6]Wu, H., Jing, Y., Cheang, C., et al.. “GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation.” arXiv 2312.13139, 2023. Link

Frequently Asked Questions

CALVIN is designed as a research benchmark for evaluating long-horizon, language-conditioned manipulation algorithms, not as a training data source for production robots. Its simulated play data captures useful manipulation diversity but lacks the real-world visual complexity, contact dynamics, and linguistic variety needed for production deployment. Policies that achieve high scores on CALVIN's chain-of-5 evaluation (current SOTA exceeds 3.5 average chain length) routinely fail when transferred to physical hardware due to the sim-to-real gap in rendering, physics, and sensor noise. Use CALVIN for architecture development and benchmarking, then transition to Claru's real-world data for deployment training.

Yes, CALVIN is released under the MIT License, which permits commercial use. However, the practical limitation is significant: simulation-only play data requires substantial real-world fine-tuning to achieve production-level performance on physical robots. The MIT license covers the benchmark code, data, and task definitions, so you can freely incorporate CALVIN into your research and development pipeline. The real cost is not licensing but the engineering effort to bridge the sim-to-real gap -- effort that is often more expensive than collecting real-world data directly.

CALVIN uses 'play data' -- unstructured manipulation where operators freely interact with objects without specific task instructions, inspired by the 'Learning from Play' paradigm (Lynch et al., 2019). This produces diverse, multi-skill trajectories: a single 5-minute play episode might contain pushing blocks, opening a drawer, toggling switches, and stacking objects. The advantage is natural skill diversity without the artificial uniformity of task-directed collection. The disadvantage is that play data lacks task boundaries, success labels, and explicit language pairing until hindsight relabeling is applied. For production training, task-directed demonstrations with validated outcomes are more sample-efficient because every trajectory provides clean training signal for a specific skill. Claru provides task-directed demonstrations with success validation.

CALVIN chains up to 5 sequential language-conditioned tasks and measures how many the agent completes in order without environment resets between tasks. For example, the agent might receive: (1) 'push the blue block right', (2) 'turn on the light', (3) 'open the drawer', (4) 'lift the red block', (5) 'rotate the pink block left'. The agent must execute each in sequence, maintaining world state coherence across tasks. This evaluation exposed a critical weakness in early methods: many could complete individual tasks reliably but failed across task boundaries due to gripper state mismanagement or accumulated positional errors. Current SOTA methods (GR-1, SuSIE) achieve average completion lengths above 3.5 out of 5.

Yes. Claru collects demonstrations for multi-step task sequences matching your production workflows. Our teleoperators execute full operational sequences (e.g., retrieve item from shelf, inspect it, pack it, label the box) as continuous demonstrations, providing the long-horizon training signal that deployment requires. Unlike CALVIN's post-hoc language labeling, Claru annotates each sub-task with language during collection, producing cleaner task-boundary annotations. We also collect both task-directed demonstrations (explicit instructions before each task) and free-play demonstrations (operator-driven exploration) depending on your training pipeline's needs.

As of early 2026, over 60 published research papers cite CALVIN as an evaluation benchmark, making it one of the most widely adopted platforms for long-horizon, language-conditioned manipulation. Notable methods evaluated on CALVIN include HULC (Mees et al., 2022), which established the first strong baseline for the long-horizon chain evaluation; SuSIE (Black et al., 2024), which demonstrated that pretrained image-editing diffusion models could be repurposed for zero-shot manipulation; and GR-1 (Wu et al., 2023), which used large-scale video generative pretraining. The benchmark's popularity stems from its unique long-horizon evaluation protocol and its reproducible simulation environment that eliminates confounding hardware variation across research groups.