Training Data for RoboCat
Everything you need to know about RoboCat's data requirements — cross-embodiment seed demonstrations, self-improvement data loops, multi-robot format specifications, and how Claru delivers the high-quality initial data that RoboCat-style systems need to bootstrap autonomous improvement.
Input/Output Specification
Multi-view RGB images (overhead + wrist cameras) tokenized via ViT, interleaved with proprioception tokens
End-effector actions discretized into 1024 bins per dimension; variable DoF per embodiment (6-DoF + gripper for single-arm)
Natural language task descriptions tokenized and prepended to the observation-action sequence (Gato architecture)
5-10 Hz depending on embodiment (Sawyer at 5 Hz, KUKA at 10 Hz)
How Claru Data Integrates with RoboCat
Claru provides the two data assets that RoboCat-style self-improving systems cannot generate on their own: high-quality seed demonstrations and success classifier labels. Our seed demonstrations meet the strict quality bar the self-improvement loop requires — 100% successful completions, diverse object configurations, smooth motion profiles, and hardware-synchronized multi-view recording at 5-10 Hz. For the success classifier, our annotation team labels rollout outcomes with task-specific binary or graded success criteria, achieving 95%+ inter-annotator agreement. For cross-embodiment pipelines, Claru collects data across multiple robot platforms with consistent formatting while preserving each embodiment's native action space. All deliverables include full provenance: operator ID, session metadata, environment configuration, and per-episode quality scores.
What Is RoboCat?
RoboCat is a self-improving foundation agent for robot manipulation developed by Google DeepMind, introduced in June 2023. The core innovation is a training loop where the model generates its own training data through autonomous practice, filters that data for successful task completions, and retrains on the expanded dataset — progressively improving without additional human demonstrations. RoboCat was the first system to demonstrate that a single vision-language-action model could autonomously improve across multiple robot embodiments and hundreds of tasks.
The system builds on the Gato architecture — a multimodal transformer that processes tokenized images, text, and actions in a unified sequence. RoboCat extends Gato with a structured self-improvement protocol: starting from a small seed dataset of human demonstrations (as few as 100 per task), the model performs autonomous rollouts in the real world, a success classifier filters the rollouts for task completions, and the successful trajectories are added to the training set. The model is then retrained from scratch on the expanded dataset. Each iteration of this loop yields 2-3x improvement in task success rate.
DeepMind demonstrated RoboCat across four different robot embodiments: a Sawyer arm, a Panda arm, a KUKA arm, and a custom bi-manual setup. The model was trained on 253 tasks spanning pick-and-place, stacking, insertion, lid manipulation, and more complex assembly operations. Critically, the cross-embodiment training was not just a curiosity — demonstrations from one robot platform measurably improved performance on other platforms, even when the embodiments had different kinematic structures, grippers, and camera configurations.
RoboCat's approach has important implications for data strategy. Because the model generates and curates its own data after the initial bootstrapping phase, the quality and diversity of the seed demonstrations is paramount. Poor seed data leads to a negative spiral where the model's autonomous practice generates low-quality trajectories that further degrade performance. Conversely, high-quality seed demonstrations enable a virtuous cycle where each iteration compounds improvements.
Key Metrics from the Paper
RoboCat Input/Output Specification
| Parameter | Specification |
|---|---|
| Observation Format | Multi-view RGB images (typically 2 cameras: overhead + wrist), tokenized via ViT and interleaved with proprioception tokens |
| Action Format | Continuous end-effector actions tokenized into 1024 discrete bins per dimension; variable DoF per embodiment (6-DoF + gripper for single-arm, higher for bi-manual) |
| Language Conditioning | Natural language task descriptions tokenized and prepended to the observation-action sequence (Gato-style) |
| Control Frequency | 5-10 Hz depending on embodiment (Sawyer at 5 Hz, KUKA at 10 Hz) |
| Proprioception | Joint positions and velocities included as tokenized input alongside visual observations |
| Episode Length | Variable; typically 100-500 timesteps per episode depending on task complexity |
Architecture and Key Innovations
RoboCat's architecture is an extension of DeepMind's Gato model — a 1.2 billion parameter multimodal transformer that tokenizes all inputs and outputs into a single sequence. Visual observations (multi-view RGB images) are processed through a Vision Transformer (ViT) encoder that produces a fixed number of visual tokens per frame. These tokens are interleaved with tokenized proprioceptive state (joint positions and velocities), language tokens (the task description), and action tokens in an autoregressive sequence. The model is trained to predict the next action token given the preceding context.
The key architectural innovation is not in the model itself but in the self-improvement training loop. The protocol works as follows: (1) Collect an initial seed dataset of 100+ human demonstrations per task. (2) Train the model on this seed data plus any existing cross-embodiment data. (3) Deploy the trained model for autonomous practice — the robot attempts the task repeatedly without human intervention. (4) Run a trained success classifier on the autonomous rollouts to identify successful completions. (5) Add successful rollouts to the training dataset. (6) Retrain the model from scratch on the expanded dataset. Steps 3-6 repeat for 5+ iterations.
Cross-embodiment transfer is achieved through the tokenized representation. Because all embodiments share the same tokenization scheme (actions are discretized into 1024 bins per dimension, regardless of the underlying hardware), the model learns a shared policy representation across platforms. The paper shows that training on data from all four embodiments jointly outperforms training on data from any single embodiment, even for that embodiment's own tasks. This suggests the model transfers not just visual understanding but also manipulation strategies across kinematically different robots.
The success classifier that filters autonomous rollouts is itself a critical component. DeepMind trained a separate vision-based classifier per task that evaluates the final state of each rollout. The accuracy of this classifier directly affects the quality of the self-generated data — false positives inject failed trajectories into the training set, while false negatives discard useful data. The paper reports classifier accuracies of 85-95% depending on task difficulty, with harder tasks (multi-step assembly) showing lower accuracy.
RoboCat vs Related Cross-Embodiment Models
| Feature | RoboCat (2023) | RT-1 (2022) | Octo (2024) | Gato (2022) |
|---|---|---|---|---|
| Self-Improvement | Yes (autonomous practice loop) | No | No | No |
| Embodiments | 4 (Sawyer, Panda, KUKA, bi-manual) | 1 (Everyday Robot) | 22+ (Open X-Embodiment) | 1 (Sawyer) + sim + games |
| Architecture | Gato (1.2B multimodal transformer) | EfficientNet + Transformer (35M) | Custom transformer (93M) | Multimodal transformer (1.2B) |
| Action Representation | 1024-bin discrete tokens | 11-bin discrete tokens | Continuous (diffusion head) | 1024-bin discrete tokens |
| Seed Data Per Task | 100 demonstrations | ~175 avg demonstrations | Variable (Open X-Embodiment mix) | Not specified per task |
| Open Source | No | No (data format is open) | Yes | No |
Training Data Requirements
RoboCat's data requirements are unique because of the self-improvement loop. The model's training dataset grows over time, but the initial seed data determines whether the self-improvement process converges to high performance or collapses. The paper uses 100 human demonstrations per task as the minimum seed, collected via teleoperation on each of the four robot platforms. Each demonstration consists of multi-view RGB video (overhead and wrist cameras), proprioceptive state at 5-10 Hz, end-effector actions, and a natural language task description.
Seed demonstration quality is the single most important factor. The 100 demonstrations must be: (1) Successful — every trajectory should complete the task to the goal state, with no failed or partial attempts in the seed set. (2) Diverse — the demonstrations should cover the natural variation in object positions, orientations, sizes, and colors that the robot will encounter during autonomous practice. (3) Smooth — trajectories should be continuous and efficient, without unnecessary pauses, restarts, or jittery movements that would confuse the model. (4) Consistently formatted — all demonstrations for a given embodiment must share the same camera viewpoints, action space normalization, and control frequency.
For cross-embodiment training, the paper shows that pooling data across all four platforms — even with different kinematic structures — consistently improves per-platform performance. This means teams building RoboCat-style systems should collect seed data from as many robot platforms as possible, even if the end goal is deployment on a single platform. The shared visual and language understanding transfers across embodiments, providing a stronger starting point for the self-improvement iterations.
After the initial seed phase, the success classifier becomes a critical data quality gate. The classifier itself requires training data: typically 500-1,000 labeled rollout outcomes (success/failure) per task, annotated by human operators. This annotation cost is often overlooked but is essential for the self-improvement loop to function. Poor classifier accuracy leads to either dataset poisoning (false positives) or data starvation (false negatives), both of which degrade the improvement loop.
How Claru Data Integrates with RoboCat
Claru provides the high-quality seed demonstrations that RoboCat-style self-improving systems require to bootstrap the autonomous improvement loop. Our operators are trained specifically for seed-quality collection: every trajectory is a successful completion, covers the full diversity of object configurations, follows smooth and efficient motion profiles, and is recorded with hardware-synchronized multi-view cameras at the control frequency your embodiment requires (5-10 Hz for RoboCat-compatible pipelines).
Beyond seed demonstrations, Claru provides the human annotations needed to train the success classifier that gates the self-improvement loop. Our annotation team labels rollout outcomes (success/failure) with task-specific criteria, achieving inter-annotator agreement rates above 95%. We can also provide graded success labels (partial completion scores) for tasks where binary success/failure is too coarse — for example, multi-step assembly tasks where the robot may complete 3 of 5 steps.
For teams building cross-embodiment datasets, Claru collects demonstrations across multiple robot platforms with consistent data formatting. We normalize camera placement, resolution, and frame rate across embodiments while preserving each platform's native action space — matching RoboCat's requirement for hardware-specific action spaces within a shared visual/language representation. All datasets include full provenance documentation: operator ID, collection session, environment configuration, and per-episode quality scores.
Key References
- [1]Bousmalis, K., Vinyals, O., Lever, G., et al.. “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation.” arXiv 2306.11706, 2023. Link
- [2]Reed, S., Zolna, K., Parisotto, E., et al.. “A Generalist Agent.” arXiv 2205.06175, 2022. Link
- [3]Brohan, A., Brown, N., Carbajal, J., et al.. “RT-1: Robotics Transformer for Real-World Control at Scale.” RSS 2023, 2022. Link
- [4]Ghosh, D., Walke, H., Pertsch, K., Black, K., et al.. “Octo: An Open-Source Generalist Robot Policy.” arXiv 2405.12213, 2024. Link
- [5]O'Neill, J., Rehman, T., Gupta, A., et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
Frequently Asked Questions
The 100 demonstrations are just the seed — RoboCat generates thousands more through autonomous practice. The self-improvement loop multiplies the effective training set by 5-10x per iteration over 5 iterations. However, the quality bar for those 100 seed demonstrations is extremely high: every trajectory must be a clean, successful completion with diverse object configurations. Poor seed data causes the self-improvement loop to diverge rather than improve. This is why Claru's operators are specifically trained for seed-quality collection.
After training on seed data, the robot performs autonomous rollouts (attempting tasks without human guidance). A separate success classifier evaluates each rollout's final state to determine if the task was completed. Successful rollouts are added to the training set, and the model retrains from scratch. The classifier itself needs 500-1,000 labeled examples (success/failure) per task — this is human annotation of rollout outcomes, which Claru provides as part of the data pipeline.
Yes, but you will get worse results than with multi-platform data. The RoboCat paper explicitly shows that cross-embodiment training improves performance on every individual platform. If you only have one robot, you can still benefit from the self-improvement loop, but the initial policy will be weaker and may require more seed demonstrations (200-500 instead of 100) to bootstrap reliably. Alternatively, you can include publicly available data from other platforms in your training mix.
Yes, this is a real risk called 'negative self-play' or 'mode collapse.' If the success classifier has low accuracy (below ~85%), false positives inject failed trajectories into the training set, which degrades model performance in the next iteration, producing even worse rollouts. The paper addresses this by: (1) retraining from scratch each iteration (not fine-tuning, which would compound errors), (2) maintaining the original seed data in every training run, and (3) using high-accuracy classifiers trained on substantial labeled data. If you see performance decreasing across iterations, the first thing to audit is classifier accuracy.
You need: (1) a robot with at least two cameras (overhead + wrist/side), (2) a teleoperation interface capable of recording at 5-10 Hz (SpaceMouse, VR controllers, or leader-follower arms), (3) sufficient compute for training a ~1B parameter transformer (8+ A100 GPUs or equivalent), and (4) the ability to run the robot autonomously for extended periods for the practice rollouts. The autonomous practice phase is the biggest logistical hurdle — you need unattended robot operation for hundreds of hours, which requires safety systems, automatic reset mechanisms, and reliable object placement.
Get Seed Data for RoboCat-Style Self-Improvement
Tell us about your robot platforms and target tasks. Claru delivers the high-quality seed demonstrations and classifier labels that bootstrap RoboCat-style autonomous improvement loops.