Training Data for RT-2
Everything you need to know about RT-2's data requirements — observation formats, action tokenization into 256-bin text tokens, volume benchmarks from 130K real-robot demonstrations, and how Claru delivers RT-2-ready datasets for VLA fine-tuning and evaluation.
Input/Output Specification
320x320 RGB images (single head-mounted camera)
7-DoF end-effector deltas + gripper, each discretized into 256 bins and emitted as text tokens
Free-form natural language instructions
3 Hz
How Claru Data Integrates with RT-2
Claru delivers datasets fully compatible with RT-2-style VLA architectures including OpenVLA and Octo. Our pipeline produces demonstrations at 320x320 resolution with free-form language instructions and 7-DoF actions discretized into the standard 256-bin tokenization scheme. Unlike the original RT-2 data from a single kitchen, Claru collects across 100+ cities in diverse environments -- homes, offices, warehouses, retail -- providing the distributional diversity RT-2-style models need to generalize. Every episode undergoes automated quality checks for frame-action synchronization, action range consistency, and language instruction diversity. We deliver in RLDS format with full metadata including camera intrinsics and action normalization statistics.
What Is RT-2?
RT-2 (Robotic Transformer 2) is a vision-language-action (VLA) model from Google DeepMind that demonstrated, for the first time, that co-training a large vision-language model on both internet-scale image-text data and robot demonstration data produces emergent robotic capabilities not present in either data source alone. Published in July 2023 by Anthony Brohan and over 50 co-authors, RT-2 showed that a robot could interpret novel commands, reason about unseen objects, and perform multi-step semantic inference -- all without those behaviors being explicitly programmed or demonstrated in robot training data.
The core insight behind RT-2 is treating robot actions as text tokens. Instead of building a separate action prediction head, RT-2 discretizes each of the 7 degrees of freedom in the end-effector action space into 256 bins and represents them as integer strings within the VLM vocabulary. A single output from the model might look like '1 128 91 241 5 101 127', where each number is a bin index for one action dimension. This allows the model to leverage the full reasoning capacity of its language backbone -- PaLI-X at 55 billion parameters or PaLM-E at 12 billion parameters -- to generate actions conditioned on visual observations and natural language instructions.
In evaluation across over 6,000 real-robot trials, RT-2 achieved 62% success on tasks involving previously unseen objects (compared to 32% for RT-1), and could execute semantically complex commands like 'move the apple to the plate that is the same color' or 'pick up the object that could be used as a hammer.' These emergent capabilities arise entirely from web-scale pretraining transfer and were never demonstrated in the robot training data.
RT-2 is not open-source, but its architecture has directly inspired open alternatives including OpenVLA and Octo. Understanding RT-2's data pipeline is essential for any team building VLA systems, as the action tokenization scheme and co-training methodology it established have become standard practice in the field.
RT-2 Key Metrics
RT-2 Input/Output Specification
| Parameter | Specification |
|---|---|
| Observation Format | 320x320 RGB images from a single head-mounted camera on the Everyday Robot |
| Action Format | 7-DoF end-effector deltas (x, y, z position + x, y, z rotation + gripper), each discretized into 256 bins and represented as text tokens |
| Language Conditioning | Free-form natural language instructions (e.g., 'pick up the red can') |
| Control Frequency | 3 Hz action prediction rate |
| Action Horizon | Single-step action prediction per inference call |
| Episode Length | Typically 20-40 steps (7-13 seconds at 3 Hz) |
Architecture and Key Innovations
RT-2 builds on two VLM backbones: PaLI-X (55B parameters) and PaLM-E (12B parameters). Both were originally trained for visual question answering, image captioning, and other vision-language tasks on internet-scale datasets. The RT-2 training procedure co-fine-tunes these backbones on a mixture of their original web data and robot demonstration data, ensuring the model retains its language understanding while acquiring robotic control capabilities.
The action tokenization mechanism is RT-2's most influential contribution. Each action dimension is uniformly discretized into 256 bins spanning the observed data range. The resulting integer tokens are formatted as a string (e.g., '1 128 91 241 5 101 127 217' for 7-DoF plus a terminate/continue flag) and appended to the training corpus alongside standard VQA-style question-answer pairs. At inference time, the model receives a camera image and a language instruction, and autoregressively generates 8 tokens that decode into a physical robot action.
The co-training strategy mixes robot data at roughly a 50/50 ratio with web VLM data during fine-tuning. The authors found this ratio critical: too much robot data degrades the model's language reasoning, while too little fails to ground actions in the physical world. The PaLI-X 55B variant consistently outperformed the 12B PaLM-E variant, suggesting that scale in the language backbone directly translates to better robotic generalization.
A key architectural difference from RT-1 is the elimination of a dedicated robot-specific network. RT-1 used a FiLM-conditioned EfficientNet architecture trained exclusively on robot data. RT-2 instead repurposes the general-purpose VLM, demonstrating that robot control can be treated as a special case of vision-language understanding. This paradigm shift opened the door to VLA architectures now used across the field.
RT-2 vs Related Models
| Dimension | RT-1 | RT-2 (PaLI-X) | RT-2 (PaLM-E) | OpenVLA |
|---|---|---|---|---|
| Parameters | 35M | 55B | 12B | 7B |
| Robot Training Data | 130K demos | 130K demos | 130K demos | 970K demos (OXE) |
| Web Data Co-Training | None | Yes (VLM data) | Yes (VLM data) | Yes (Prismatic VLM) |
| Action Representation | Discrete (custom head) | Text tokens (256 bins) | Text tokens (256 bins) | Text tokens (256 bins) |
| Novel Object Generalization | 32% | 62% | ~50% | Varies by embodiment |
| Open Source | Yes | No | No | Yes |
Training Data Requirements
RT-2's robot demonstration corpus contains approximately 130,000 episodes collected over 17 months using a fleet of 13 mobile manipulators (Everyday Robots) in a real office kitchen environment. Each episode consists of a sequence of 320x320 RGB image observations paired with 7-DoF end-effector delta actions at 3 Hz. The episodes cover over 700 task descriptions involving tabletop manipulation: picking, placing, moving objects near/on/into other objects, opening and closing drawers, and wiping surfaces.
The web-scale data component is equally important. For the PaLI-X variant, this includes billions of image-text pairs from internet sources used during original VLM pretraining. The co-fine-tuning stage then trains on a roughly equal mixture of robot episodes (formatted as 'instruction, image -> action tokens') and VLM tasks (formatted as 'question, image -> text answer'). This dual data stream is what enables emergent reasoning: the web data teaches the model about objects, spatial relationships, and language semantics, while the robot data teaches it to map these concepts to physical actions.
For teams building RT-2-style systems, the data requirements extend beyond raw demonstration count. Each demonstration needs precise temporal alignment between camera frames and action labels, consistent 256-bin action normalization across the dataset, and diverse language instructions that match the free-form conditioning the model will receive at deployment. The original RT-2 dataset included multiple language templates per task (e.g., 'pick up the can', 'grab the soda can', 'take the red can') to improve instruction following robustness.
Scaling laws from the RT-2 paper suggest that co-training on more robot data improves novel-object generalization roughly logarithmically. Moving from 10K to 130K demonstrations approximately doubled the emergent capability score. For fine-tuning RT-2-style architectures on a new embodiment, the community has found that 10,000-50,000 demonstrations across diverse tasks and environments are needed for strong zero-shot generalization, while 1,000-5,000 demonstrations suffice for single-task specialization.
How Claru Data Integrates with RT-2
While RT-2 itself is not publicly released, Claru provides datasets fully compatible with RT-2-style VLA architectures -- including OpenVLA, Octo, and custom PaLI-X fine-tuning pipelines. Our data delivery pipeline produces demonstrations at 320x320 resolution with free-form language instruction annotations and 7-DoF end-effector action labels discretized into the standard 256-bin tokenization scheme that RT-2 established.
Claru's data collection methodology directly addresses the scaling bottleneck that limited the original RT-2 work. Where Google used 13 robots in a single kitchen over 17 months, Claru operates collection campaigns across 100+ cities with diverse home, office, warehouse, and retail environments. This environmental diversity is precisely what RT-2-style models need to generalize beyond the narrow distribution of training environments.
Our quality pipeline enforces the temporal alignment and action normalization standards that RT-2-style training demands. Every episode undergoes automated checks for frame-action synchronization, action range consistency, and language instruction diversity. We deliver in RLDS format -- the standard for VLA training -- with complete metadata including camera intrinsics, robot URDF references, and action normalization statistics that training pipelines can consume directly.
For teams pursuing the co-training paradigm RT-2 pioneered, Claru can also supply the scene-level visual data (without robot actions) needed for representation learning. Our egocentric video corpus covers manipulation scenarios with natural language descriptions, providing the diverse visual grounding data that complements robot demonstrations in the co-training mixture.
Key References
- [1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023 / arXiv 2307.15818, 2023. Link
- [2]Brohan et al.. “RT-1: Robotics Transformer for Real-World Control at Scale.” RSS 2023 / arXiv 2212.06817, 2022. Link
- [3]Driess et al.. “PaLM-E: An Embodied Multimodal Language Model.” ICML 2023 / arXiv 2303.03378, 2023. Link
- [4]Chen et al.. “PaLI-X: On Scaling up a Multilingual Vision and Language Model.” arXiv 2305.18565, 2023. Link
- [5]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv 2406.09246, 2024. Link
- [6]O'Neill et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024 / arXiv 2310.08864, 2024. Link
Frequently Asked Questions
RT-2 is not open-source. Google DeepMind has published the paper and detailed methodology but not released model weights or training code. Open-source alternatives that follow RT-2's VLA architecture include OpenVLA (7B parameters, trained on Open X-Embodiment) and Octo (93M parameters, generalist policy). Claru provides data compatible with all RT-2-style VLA architectures, delivered in RLDS format with 256-bin action tokenization.
RT-1 used a 35M-parameter custom FiLM-conditioned EfficientNet trained exclusively on 130K robot demonstrations. RT-2 replaces this with a 55B-parameter PaLI-X (or 12B PaLM-E) vision-language model co-trained on both web data and the same robot demonstrations. The key difference is emergent reasoning: RT-2 can handle novel objects and complex instructions not present in robot training data (62% vs 32% success on unseen objects) because it transfers knowledge from internet-scale pretraining.
The original RT-2 used 130,000 demonstrations from 13 robots over 17 months. For fine-tuning RT-2-style architectures on a new embodiment, community experience suggests 10,000-50,000 diverse demonstrations for strong zero-shot generalization, or 1,000-5,000 for single-task specialization. The co-training paradigm means the model generalizes from fewer robot demonstrations than RT-1 would need, but data quality -- precise action labeling, diverse language instructions, and varied environments -- matters more than raw volume.
RT-2 discretizes each of 7 action dimensions (x/y/z position deltas, x/y/z rotation deltas, and gripper open/close) into 256 uniform bins. The resulting bin indices are formatted as space-separated integers (e.g., '1 128 91 241 5 101 127') plus a terminate/continue token. This representation allows robot actions to be treated identically to natural language tokens in the VLM vocabulary, which is the key innovation enabling web-knowledge transfer.
Yes. The action tokenization scheme is hardware-agnostic -- you just need to define the 7-DoF action space boundaries for your robot and discretize into 256 bins. OpenVLA has demonstrated this across 22 different robot embodiments in the Open X-Embodiment dataset. Claru collects data on client-specified hardware and delivers it with the correct action normalization for RT-2-style training pipelines.
Get Data Formatted for RT-2-Style VLA Training
Tell us about your VLA project and we will deliver demonstrations with 256-bin action tokenization, language annotations, and RLDS formatting that your RT-2-style training pipeline requires.