Language-Conditioned Robot Data: Paired Demonstrations with Natural Language Instructions

Language-conditioned robot policies promise natural human-robot interaction: tell a robot what to do in plain language and it executes. But training these models requires paired data — demonstrations annotated with the natural language instructions they execute — at a diversity and scale that current datasets cannot provide. The language grounding gap is the bottleneck.

Why Is Language-Paired Robot Data So Hard to Collect?

Language-conditioned robot learning requires demonstrations that are paired with natural language instructions describing the task being performed. This pairing is expensive because it requires either annotating existing demonstrations with language after the fact, or collecting demonstrations in response to specific language commands. RT-2 demonstrated that vision-language-action models can transfer web-scale language understanding to robotic control, improving generalization by 3x, but this transfer depends on fine-tuning data where language instructions are precisely aligned with observed robot behaviors. OpenVLA showed that a 7B parameter model outperformed the 55B RT-2-X by 16.5% when trained on high-quality language-paired demonstrations, proving that instruction quality matters more than scale. The challenge is that collecting high-quality language-paired data requires coordinating human annotators who can write diverse, accurate, and unambiguous natural language descriptions for each demonstration.

[1][2]

What Are the Quality Problems in Current Language-Robot Datasets?

Language annotations in existing datasets suffer from three systematic quality issues. First, vocabulary poverty: annotations use a narrow set of template-like phrases ('pick up the red block', 'put the cup on the plate') that do not reflect how humans naturally give instructions. CALVIN provides language-conditioned manipulation benchmarks but with templated language that lacks the variety of real human speech. Second, ambiguity tolerance: annotations do not account for the ambiguity inherent in natural language ('grab that thing over there'), which robots must resolve through visual grounding. Third, granularity mismatch: some annotations describe goals ('make a sandwich') while others describe atomic actions ('move the knife 3cm to the right'), and the mixed granularity confuses policy learning. SayCan showed that grounding language in affordance functions requires diverse language paired with demonstrations that cover both successful and failed attempts at instruction following.

[3][4]

How Does Language Diversity Affect Policy Generalization?

A language-conditioned policy must understand that 'grab the mug', 'pick up the cup', 'get me that coffee thing', and 'take the ceramic vessel' can all refer to the same action. This requires training data with diverse paraphrases of the same instruction across different demonstrations. Octo was trained on 800,000 trajectories from 25 datasets but many lacked language annotations entirely, limiting the model's language-conditioning capability. Open X-Embodiment includes language annotations for a subset of its trajectories, but the annotations were added by different teams with inconsistent conventions, producing vocabulary and granularity mismatches that degrade cross-dataset training. For production deployment where users give natural, unscripted instructions, language diversity in training data directly determines the range of commands a robot can understand.

[5][6]

How Do Open Datasets Compare for Language-Conditioned Robot Training?

The table below compares datasets with language annotations relevant to robot policy training against Claru custom collection. The critical differentiators are language diversity, annotation quality, and pairing precision.

Name	Scale	Tasks	Environments	Limitations
CALVIN	24 hours of play data, 400+ tasks	Language-conditioned tabletop manipulation with 34 unique tasks	Single simulated tabletop environment	Templated language only; single environment; simulation; narrow vocabulary that does not reflect natural speech
Open X-Embodiment (language subset)	Partial language annotations across 1M+ trajectories	Mixed manipulation tasks with inconsistent language annotations	Research labs across 22 robot platforms	Inconsistent annotation conventions across contributing teams; many trajectories lack language; vocabulary and granularity mismatches
BridgeData V2	60K+ trajectories with language labels	Tabletop manipulation with natural language task descriptions	24 environments in a single lab	Single lab setup; post-hoc language annotations; limited environment diversity
DROID	76K trajectories with language annotations	Table-top manipulation with crowd-sourced language labels	13 institutions; lab environments	Crowd-sourced annotations have variable quality; limited to lab manipulation; fixed robot morphology
Claru Custom	386K+ video clips, ~500 contributors, configurable language annotation depth	Configurable: multi-granularity language pairing from goal-level to step-level instructions across any manipulation domain	Global real-world coverage; homes, workplaces, outdoor; 10+ categories across multiple countries	Requires engagement lead time (days to launch, 1-2 week calibration); not a public benchmark

CALVIN

Scale24 hours of play data, 400+ tasks

TasksLanguage-conditioned tabletop manipulation with 34 unique tasks

EnvironmentsSingle simulated tabletop environment

LimitationsTemplated language only; single environment; simulation; narrow vocabulary that does not reflect natural speech

Open X-Embodiment (language subset)

ScalePartial language annotations across 1M+ trajectories

TasksMixed manipulation tasks with inconsistent language annotations

EnvironmentsResearch labs across 22 robot platforms

LimitationsInconsistent annotation conventions across contributing teams; many trajectories lack language; vocabulary and granularity mismatches

BridgeData V2

Scale60K+ trajectories with language labels

TasksTabletop manipulation with natural language task descriptions

Environments24 environments in a single lab

LimitationsSingle lab setup; post-hoc language annotations; limited environment diversity

DROID

Scale76K trajectories with language annotations

TasksTable-top manipulation with crowd-sourced language labels

Environments13 institutions; lab environments

LimitationsCrowd-sourced annotations have variable quality; limited to lab manipulation; fixed robot morphology

Claru Custom

Scale386K+ video clips, ~500 contributors, configurable language annotation depth

TasksConfigurable: multi-granularity language pairing from goal-level to step-level instructions across any manipulation domain

EnvironmentsGlobal real-world coverage; homes, workplaces, outdoor; 10+ categories across multiple countries

LimitationsRequires engagement lead time (days to launch, 1-2 week calibration); not a public benchmark

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Language-conditioned robot training data consists of robot demonstrations (video, action trajectories, or both) paired with natural language instructions describing the task being performed. This data trains policies that accept language commands as input and generate robot actions as output, enabling natural human-robot interaction where users tell robots what to do in plain language.

Claru uses a multi-granularity annotation framework where tasks are described at goal, plan, and step levels by the contributors who perform the demonstrations. The annotation interface enforces structural consistency while allowing natural vocabulary diversity. Every annotation passes same-day human QA review. The global contributor network spanning multiple countries ensures cross-dialectal language variety.

Yes. Claru's structured activity taxonomy supports three annotation levels: goal-level (what to achieve), plan-level (sequence of sub-tasks), and step-level (individual motor actions). Each level is independently annotated and verified. This hierarchical structure maps directly to planning architectures used by language-conditioned policies like SayCan and Code as Policies.

Standard VLA training data pairs visual observations with action labels, enabling a model to imitate demonstrated behaviors. Language-conditioned data adds a third modality: natural language instructions that specify which behavior to execute. This enables task-conditioned policies where a single model can perform different tasks based on language input, rather than requiring separate policies for each task.

╔════════════════════╗
║  INITIATE CONTACT  ║
║  ▶ CONNECT NOW     ║
╚════════════════════╝

┌────────────────┐
│ STATUS: READY  │
│ AWAITING INPUT │
└────────────────┘

// INITIATE

Your next hire isn't a vendor.
It's a data team.

Tell us what you're training. We'll scope the dataset.

</>

References

[1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv 2023, 2023. Web-scale vision-language pre-training improved robot policy generalization by 3x when fine-tuned on language-paired robot demonstrations. Link
[2]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv 2024, 2024. 7B parameter VLA outperformed RT-2-X (55B) by 16.5% on manipulation benchmarks through higher-quality language-paired demonstrations. Link
[3]Ahn et al.. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.” arXiv 2022, 2022. Demonstrated that grounding language instructions in robot affordances requires diverse language paired with demonstrations covering both successful and failed instruction-following attempts. Link
[4]Mees et al.. “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation.” IEEE RA-L 2022, 2022. Established a language-conditioned manipulation benchmark revealing that templated language annotations limit policy generalization to novel instructions. Link
[5]Ghosh et al.. “Octo: An Open-Source Generalist Robot Policy.” arXiv 2024, 2024. Trained on 800,000 trajectories from 25 datasets; noted that missing language annotations in many datasets limited language-conditioning capability. Link
[6]O'Brien et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv 2024, 2024. Includes partial language annotations across 1M+ trajectories but with inconsistent conventions across contributing teams. Link