Training Data for Google DeepMind Robotics

DeepMind defined the VLA paradigm with RT-1 and RT-2. Here is how diverse real-world data fuels the next generation of robot foundation models.

About Google DeepMind Robotics

Google DeepMind's robotics division created RT-1, RT-2, and RT-X — the foundational vision-language-action models that sparked the current wave of robot foundation model research. Their work on the Open X-Embodiment dataset established the first large-scale cross-embodiment robot learning benchmark.

Vision-language-action models for robotsCross-embodiment robot learningInternet-scale pretraining for robot controlLarge-scale robot data collection and curationSim-to-real transfer with generative models

DeepMind Robotics at a Glance

RT-2

Flagship VLA Model

X-Embodiment Robots

130K+

RT-1 Demonstrations

Contributing Labs

1M+

Total Robot Episodes

Known Data Requirements

DeepMind Robotics pioneered the data-scaling approach with RT-2, demonstrating that VLMs pretrained on internet data can be fine-tuned for robot control. Their ongoing work requires ever-larger robot manipulation datasets, more diverse embodiment coverage for RT-X successors, and real-world validation data to prove that scaled pretraining translates to robust real-world performance.

Diverse manipulation data for VLA model scaling

Source: RT-2 paper (Brohan et al., 2023) and subsequent scaling research

Massive quantities of robot manipulation demonstrations spanning hundreds of objects, tasks, and environments — the data fuel for next-generation VLA models.

Cross-embodiment data beyond Open X-Embodiment

Source: Open X-Embodiment dataset gaps and RT-X scaling requirements

Manipulation recordings from embodiments underrepresented in Open X-Embodiment — humanoids, mobile manipulators, dexterous hands — to improve cross-robot generalization.

Real-world deployment validation data

Source: Gap between lab benchmarks and commercial deployment requirements

Authentic data from target deployment environments — kitchens, offices, retail — to validate that lab-trained models work in real-world conditions.

Language-grounded task data at scale

Source: RT-2's language-conditioned action generation architecture

Manipulation demonstrations paired with diverse natural language descriptions — including paraphrases, implicit instructions, and multi-step task decompositions — to expand the language grounding capabilities of successor VLA models.

Long-horizon multi-step task recordings

Source: SayCan and Inner Monologue research on task planning

Complete recordings of multi-step tasks lasting minutes rather than seconds — kitchen meal preparation, workspace tidying, package assembly — where planning and error recovery are as important as individual manipulation primitives.

How Claru Data Addresses These Needs

Lab Need	Claru Offering	Rationale
Diverse manipulation data for VLA model scaling	Manipulation Trajectory Dataset + Egocentric Activity Dataset	Claru's combined manipulation and egocentric datasets provide millions of annotated clips showing physical interactions — a curated alternative to raw internet scraping for VLA pretraining.
Cross-embodiment data beyond Open X-Embodiment	Custom Multi-Embodiment Collection Campaigns	Claru can coordinate data collection across different robot platforms to fill specific coverage gaps in DeepMind's cross-embodiment training distribution.
Real-world deployment validation data	Custom Environment-Specific Collection	Claru's global collector network can gather data in real deployment target environments — actual kitchens, real offices, operating retail stores — providing the authentic validation data labs cannot generate internally.
Long-horizon multi-step task recordings	Egocentric Activity Dataset (~386K clips) + Custom Long-Horizon Collection	Claru's egocentric video captures complete multi-step activities in real households and workplaces, with temporal annotations that segment individual steps within longer task sequences.

Technical Data Analysis

Google DeepMind Robotics defined the current paradigm of robot learning with their RT series of papers. RT-1 demonstrated that a Transformer trained on large-scale robot data could generalize across tasks and objects. RT-2 showed that pretraining on internet-scale vision-language data dramatically improves robot task understanding. RT-X and Open X-Embodiment proved that training on data from multiple robot embodiments enables cross-robot transfer.

Each successive paper required more data than the last. RT-1 used 130K demonstrations from a single robot. RT-2 added internet-scale VLM pretraining. Open X-Embodiment aggregated data from 22 robot embodiments across 21 institutions. The trajectory is clear: next-generation models will need even larger, more diverse datasets.

The Open X-Embodiment dataset, while groundbreaking, has known limitations. It is heavily biased toward tabletop manipulation with single-arm robots in laboratory settings. Humanoid data, mobile manipulation data, and dexterous hand data are underrepresented. Outdoor environments, industrial settings, and domestic spaces are scarce. Filling these gaps requires systematic data collection effort beyond what academic labs can contribute organically.

DeepMind's research trajectory also reveals a growing gap between laboratory performance and real-world deployment. Models that achieve impressive success rates on lab benchmarks often fail in authentic environments where lighting, clutter, object variety, and human interference differ from training conditions. Closing this gap requires validation data collected in real deployment environments — something Claru's distributed collection network is uniquely positioned to provide.

The SayCan and Inner Monologue line of work demonstrates that long-horizon task planning is the next frontier after single-step manipulation. These systems use language models to decompose complex instructions into executable robot actions, but they need training data that captures the full arc of multi-step tasks — including the error detection and recovery behaviors that are absent from single-step demonstration datasets. Kitchen meal preparation, workspace organization, and package assembly are examples of tasks where planning matters as much as manipulation skill.

Key Research & References

[1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[2]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
[3]Brohan et al.. “RT-1: Robotics Transformer for Real-World Control at Scale.” RSS 2023, 2023. Link
[4]Ahn et al.. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.” CoRL 2022, 2022. Link
[5]Huang et al.. “Inner Monologue: Embodied Reasoning through Planning with Language Models.” CoRL 2022, 2022. Link
[6]Bousmalis et al.. “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation.” arXiv 2306.11706, 2023. Link

Frequently Asked Questions

A VLA model takes visual observations and language instructions as input and outputs robot actions. It needs three types of data: internet-scale vision-language data for pretraining, robot manipulation demonstrations for action learning, and diverse environment data for generalization. Each component requires massive scale.

Open X-Embodiment is biased toward tabletop manipulation with single-arm robots in laboratory settings. Humanoid data, mobile manipulation, dexterous hand data, outdoor environments, and domestic spaces are underrepresented. Next-generation models need to fill these coverage gaps with purpose-collected data.

Lab environments have controlled lighting, limited clutter, and standardized objects. Real deployment environments have variable lighting, unexpected objects, human interference, and environmental conditions labs cannot replicate. Validation data from real target environments is essential to measure and close this gap.

RT-2 co-fine-tunes a vision-language model to output robot actions alongside text. This means the robot inherits the language model's understanding of concepts, spatial relationships, and object properties from internet pretraining. A robot can follow instructions involving objects or actions it has never seen in robot training data, because the semantic understanding transfers from the VLM.

RoboCat demonstrated that robots can generate their own training data by attempting tasks and learning from successes. However, this self-improvement loop still requires high-quality human demonstration data as a seed — the robot needs initial examples to bootstrap from. Purpose-collected data from diverse environments provides the seed quality and diversity that makes self-improvement effective.