Language-Table Alternative: Production-Grade Language-Conditioned Robot Data

Google's Language-Table dataset provides 442,226 language-annotated demonstrations of tabletop block manipulation -- the largest language-conditioned robot dataset with real-time interactive language grounding. But its scope is limited to 2D planar pushing of colored blocks, the sensor modality is a single top-down RGB camera, and the task complexity falls far short of production manipulation requirements. Compare Language-Table with Claru's multi-task, multi-modal language-conditioned data.

Language-Table Profile

Institution

Google (Brain Robotics / DeepMind)

Year

2023

Scale

442,226 language-annotated demonstrations with 87,000 unique language strings

License

Apache 2.0

Modalities

Top-down RGB camera (single viewpoint)2D planar end-effector velocity actionsNatural language instructions (real-time narrations)Block positions (2D coordinates)

How Claru Helps Teams Beyond Language-Table

Language-Table made a landmark contribution to language-conditioned robot learning: it proved that a simple behavioral cloning policy trained on 442,000+ language-annotated demonstrations could follow 87,000 unique natural language instructions with 93.5% success. As part of OXE, it provides valuable language-grounding priors for generalist policies. However, the gap between pushing blocks on a table and performing production manipulation is enormous. Language-Table's 2D actions cannot represent grasping, lifting, or 3D manipulation. Its single top-down camera cannot capture the multi-view, multi-modal information that production policies require. And its limited object set does not prepare a model for the diversity of real-world objects. Claru provides the data that bridges this gap: language-conditioned demonstrations of full 6-DoF manipulation tasks, collected on your robot platform with the complete sensor stack (RGB, depth, force/torque, proprioception) and validated language annotations that ensure reliable language-action mapping. Teams that pretrain on OXE (including Language-Table) for language understanding and fine-tune on Claru data for task-specific manipulation achieve robust, language-instructable production policies.

What Is Language-Table?

Language-Table is a large-scale dataset and benchmark for language-conditioned robotic manipulation, published by Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence at Google (arXiv 2210.06407, IEEE Robotics and Automation Letters 2023). The project introduced a framework for training interactive, real-time, natural language-instructable robots and demonstrated that behavioral cloning on a sufficiently large dataset of language-annotated trajectories could produce policies that follow open-vocabulary natural language instructions in real time.

The dataset contains 442,226 human-collected demonstrations of tabletop manipulation tasks on a flat surface with colored blocks and a small number of other objects. A human teleoperator controls a robot arm (xArm with a spatula-like end-effector) via a 2D joystick to push objects around the table surface, while simultaneously providing natural language descriptions of the intended action. The observation space is a single top-down RGB camera, and the action space is 2D planar end-effector velocity commands. The resulting policy achieved a 93.5% success rate on a test set of 87,000 unique natural language strings -- an order of magnitude more commands than any previous language-conditioned robot system could handle.

The key innovation of Language-Table was demonstrating interactive language grounding: the human operator could give real-time language instructions (e.g., 'push the blue block to the left', 'make a smiley face out of blocks', 'move banana to the sum of two plus one') and the robot would respond in real time, continuously updating its behavior based on the latest language input. This represented a significant advance over prior work where language instructions were given once before execution rather than interactively during it. The dataset captures this interactive dynamic, with language annotations temporally aligned to the corresponding manipulation actions.

Language-Table is released under an Apache 2.0 license and is included in the Open X-Embodiment (OXE) collection in the RLDS format on TensorFlow Datasets. The dataset, environment (a simulated version of the tabletop setup), benchmark suite, and trained policies are all open-sourced on GitHub. While Language-Table established the viability of large-scale language-conditioned imitation learning, its specific task domain -- 2D planar block pushing with a top-down camera -- represents a narrow slice of the manipulation capabilities that production robots require.

Language-Table at a Glance

442K+

Language-Annotated Demos

87K

Unique Language Strings

93.5%

Policy Success Rate

Action Space (Planar)

Camera (Top-Down RGB)

Apache 2.0

License

Language-Table vs. Claru: Side-by-Side Comparison

A detailed comparison across the dimensions that matter when building production language-conditioned robot systems.

Dimension	Language-Table	Claru
License	Apache 2.0 -- commercially permissive	Commercial license with IP assignment
Scale	442,226 demos with 87,000 unique language strings	1K to 1M+ demonstrations with validated language annotations
Task Complexity	2D planar pushing of blocks on a flat surface	Full 6-DoF manipulation: grasping, placing, assembly, tool use, pouring
Action Space	2D planar end-effector velocity (x, y only)	Full 6-DoF or 7-DoF actions matching your robot's control interface
Sensor Modalities	Single top-down RGB camera only	RGB + depth + force/torque + proprioception + multi-view cameras
Object Diversity	Colored blocks, toy banana, and a few simple objects	Your actual production objects of any size, material, and geometry
Manipulation Type	Non-prehensile pushing only -- no grasping, lifting, or placing	Prehensile and non-prehensile: grasping, lifting, placing, inserting, pouring
Robot Platform	xArm with custom spatula end-effector	Your specific robot platform and end-effector
Language Quality	Real-time human narration during teleoperation (naturally noisy)	Validated language annotations with multi-annotator agreement checks
Environment	Single white tabletop in a lab	Your actual deployment environment with real-world clutter and variation

Key Limitations of Language-Table for Production Use

Language-Table's most significant limitation is the extreme simplicity of its manipulation task. All 442,226 demonstrations involve pushing objects across a flat surface in 2D -- there is no grasping, no lifting, no placing, no insertion, no assembly, and no 3D manipulation of any kind. The action space is 2D planar end-effector velocity (x and y), meaning the robot arm only moves in the horizontal plane without controlling height, orientation, or gripper state. Production manipulation systems must operate in full 3D with grasping, precise placement, force-controlled assembly, and complex multi-step task sequences that Language-Table's task domain does not begin to address.

The sensor modality is a single top-down RGB camera capturing the tabletop from directly above. This fixed overhead viewpoint works well for planar block pushing but is unrepresentative of how production robots perceive their environment. Real robot systems use eye-in-hand cameras, wrist-mounted cameras, multi-view setups, depth sensors, force/torque sensors, proprioceptive streams, and tactile sensors. A policy trained exclusively on Language-Table's single top-down RGB view will not generalize to the camera configurations and multi-modal sensor stacks used in production deployments.

The language annotations, while impressively diverse (87,000 unique strings), were collected as real-time narrations during teleoperation. This means they carry the noise of spontaneous speech: grammatical inconsistencies, ambiguous references, and varying levels of specificity. For production systems where language instructions must be reliably parsed and executed, the annotation quality and consistency standards of Language-Table may not meet requirements. Production deployments typically need language annotations that have been validated against the demonstrated behavior by multiple annotators to ensure that the instruction unambiguously describes the task.

The object set is extremely limited. Language-Table demonstrations primarily involve colored blocks (red, blue, green, yellow) and a small number of toy objects (a toy banana, small figurines). These objects are all small, rigid, and lightweight -- selected specifically for the 2D pushing task. Production robots manipulate objects of widely varying size (millimeter-scale components to meter-scale packages), material (rigid, soft, deformable, fragile, wet), weight (grams to kilograms), and geometry (regular, irregular, articulated). The distribution of objects in Language-Table does not prepare a policy for this diversity.

When to Use Language-Table vs. Commercial Data

Language-Table is the gold standard for research on language-conditioned manipulation and interactive language grounding. If your work focuses on how robots can follow open-vocabulary language instructions, how to design architectures for language-to-action mapping, or how to enable real-time interactive language control, Language-Table provides the largest, most carefully designed benchmark available. Its 87,000 unique language strings enable rigorous evaluation of language generalization, and the open-sourced simulation environment allows controlled experimentation.

As part of the OXE collection, Language-Table contributes valuable language-conditioned data to cross-embodiment pretraining. Generalist policies like Octo and OpenVLA benefit from Language-Table's language grounding -- even though the task is simple, the diversity of language instructions teaches models to map varied natural language to manipulation behavior. The Apache 2.0 license makes this pretraining use commercially permissible.

Switch to commercial data when your production system needs language-conditioned policies for real 3D manipulation tasks. The gap between pushing blocks on a flat surface and executing production manipulation workflows (pick-and-place, assembly, tool use, food preparation, warehouse fulfillment) is enormous. Production language-conditioned systems need demonstrations of complex tasks with validated language annotations, full sensor coverage, and the 6-DoF (or higher) action spaces that real manipulation demands.

The recommended pipeline is: pretrain on OXE (including Language-Table) for language-grounding priors and broad manipulation capability, then fine-tune on Claru's language-annotated, multi-modal demonstrations for your specific tasks and deployment environment. This captures Language-Table's strength in language understanding while grounding manipulation behavior in the task complexity and sensor coverage of real-world deployment.

How Claru Complements Language-Table

Claru provides the language-conditioned manipulation data that bridges the gap between Language-Table's 2D block pushing and the 3D manipulation tasks that production robots perform. Where Language-Table captures 442,226 demonstrations of planar pushing with top-down RGB, Claru provides expert teleoperation demonstrations of full 6-DoF manipulation -- grasping, placing, assembly, insertion, tool use, and multi-step task sequences -- each paired with validated natural language task descriptions.

Every Claru demonstration includes the multi-modal sensor coverage that production policies require: multi-view RGB and depth imagery, 6-axis force/torque measurements, complete proprioceptive state, and gripper contact detection. Language annotations are not real-time narrations but carefully crafted task descriptions validated by multiple annotators for accuracy, specificity, and consistency. This ensures that the language-action mapping your policy learns is grounded in precise, unambiguous instructions.

Claru's task scope matches your production requirements, not the constraints of a tabletop pushing setup. Whether your language-conditioned robot needs to 'pick up the red box from the left shelf and place it in the shipping bin', 'insert the M4 screw into the threaded hole with 2 Nm torque', or 'slice the tomato into 5mm rounds' -- our demonstrations capture these tasks with the action dimensions, force profiles, and multi-step structure they require. The language vocabulary, task granularity, and instruction complexity are defined collaboratively to match your deployment's natural language interface.

All data ships with a commercial license and IP assignment, delivered in RLDS (for OXE co-training), HDF5, zarr, and LeRobot formats. Language annotations include both task-level descriptions and step-level sub-instructions where applicable, enabling hierarchical language-conditioned policy training at multiple temporal granularities.

References

[1]Lynch et al.. “Interactive Language: Talking to Robots in Real Time.” IEEE RA-L 2023, 2023. Link
[2]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
[3]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[4]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” CoRL 2024, 2024. Link
[5]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link

Frequently Asked Questions

Yes. Language-Table is released under an Apache 2.0 license, which permits commercial use. However, the dataset covers only 2D planar block pushing with a single top-down RGB camera and 2D action space. While commercially licensable, Language-Table alone cannot train policies for production manipulation that requires 3D grasping, precise placement, force control, and multi-modal sensing. It is most useful as a pretraining component (within OXE) for language-grounding priors that are then fine-tuned on domain-specific data from Claru.

Language-Table's 93.5% success rate was measured on its specific task domain: pushing colored blocks on a flat white table in response to language commands, using a top-down camera and 2D planar actions. This is a controlled, simplified manipulation scenario. Success rates on real-world 3D manipulation tasks (grasping, placement, assembly) using policies pretrained on Language-Table and transferred to different robots and environments will be significantly lower without domain-specific fine-tuning. The language understanding capability transfers well, but the manipulation capability must be retrained for tasks involving 3D motion, grasping, force control, and complex object interactions.

The 2D pushing constraint was a deliberate design choice to maximize data collection scale. By restricting manipulation to horizontal planar movements with a spatula-like end-effector, the task requires only 2D joystick input, enabling human teleoperators to collect demonstrations rapidly. This design also simplified the action space (2D velocity) and observation space (single top-down camera), making it feasible to collect 442,000+ demonstrations. The tradeoff is that the resulting data does not cover 3D manipulation skills -- grasping, lifting, precise placement, rotation, insertion, force-controlled contact -- that production robots require.

Language-Table's primary contribution is the scale and diversity of its language annotations. With 87,000 unique natural language strings across 442,000 demonstrations, it provides an order of magnitude more language diversity than any other robot dataset. The instructions range from simple commands ('push the blue block left') to compositional reasoning ('move banana to the sum of two plus one') to spatial arrangements ('make a smiley face out of blocks'). This language diversity teaches models to generalize across varied instruction phrasings and compositional language structures. However, the language is tightly coupled to 2D block pushing -- production systems need language-conditioned data for more complex manipulation tasks.

Yes, and this is one of Claru's key strengths. Every Claru demonstration can include validated natural language task descriptions paired with multi-modal sensor data and full 6-DoF (or higher) action labels. Unlike Language-Table's real-time narrations, Claru's language annotations undergo multi-annotator validation to ensure accuracy and consistency. We support both task-level descriptions ('pick up the red component and insert it into slot B') and step-level sub-instructions ('approach from the left', 'grasp with 3N force', 'lift 10cm', 'rotate 90 degrees', 'insert with 2Nm torque') for hierarchical language-conditioned policies. Language annotations are defined collaboratively to match your natural language interface.

Related Resources

Glossary

Language Conditioned Policy→

Natural Language Robot Control→

Glossary

Cross Embodiment Data→

Guide

How To Build A Language Conditioned Dataset→

Guide

How To Evaluate Training Data Quality→

Solution

Vla Training Data→

From Block Pushing to Production Manipulation

Complement Language-Table's language-grounding priors with multi-modal, multi-task demonstrations on your robot. Get a custom data collection plan from our team.

Get in Touch Browse the Data Catalog