Training Data by Model
Every robot learning model has specific data requirements. Browse data specifications, volume benchmarks, and format details for 25 leading models — and see how Claru delivers model-ready datasets.
OpenVLA
2024Stanford / TRI / UC Berkeley
224x224 RGB images (single third-person view) | ~5 Hz (autoregressive); ~130 Hz with OpenVLA-OFT parallel decoding
RT-2
2023Google DeepMind
320x320 RGB images (single head-mounted camera) | 3 Hz
Octo
2024UC Berkeley
256x256 RGB (supports multi-view: primary + wrist) | ~10 Hz (variable per embodiment)
Pi-Zero (pi0)
2024Physical Intelligence
Multi-view RGB images (224x224 per view, 2-3 cameras) + natural language instruction | 50 Hz (action chunks decoded in a single forward pass)
GR00T N1
2025NVIDIA
Multi-view RGB images at configurable resolution + proprioceptive state (joint positions, velocities) | System 2 at 10 Hz (reasoning); System 1 at higher rates (motor control)
Diffusion Policy
2023Columbia / MIT / TRI
1-3 camera views at 96x96 (CNN) or 224x224 (ViT) RGB + optional proprioceptive state | 10-50 Hz (action chunks at ~25 Hz prediction frequency with 8-step execution overlap)
ACT (Action Chunking with Transformers) / ALOHA
2023Stanford / Google DeepMind
4x 480x640 RGB images (2 wrist-mounted + 2 third-person) + 14-dim joint positions | 50 Hz
RT-1
2022Google (Everyday Robots)
300x300 RGB images (single head-mounted camera), 6-frame history (current + 5 previous) | 3 Hz
BridgeData V2
2023UC Berkeley
256x256 RGB from third-person camera (optional wrist camera); 8-dim proprioceptive state | 5 Hz
Gato
2022Google DeepMind
RGB images tokenized into 16x16 patches via ResNet encoder + continuous proprioceptive state mu-law encoded to 1024 bins | Variable per environment: ~5-20 Hz for robotics tasks, up to 60 Hz for Atari games
PaLM-E
2023Google DeepMind / TU Berlin
Multi-modal: 224x224 or 512x512 RGB images (ViT-22B encoder) + optional scene state vectors and object-centric representations | Plan-level (~1-2 Hz for plan generation); downstream motor controllers operate at 3-10 Hz
RoboCat
2023Google DeepMind
Multi-view RGB images (overhead + wrist cameras) tokenized via ViT, interleaved with proprioception tokens | 5-10 Hz depending on embodiment (Sawyer at 5 Hz, KUKA at 10 Hz)
RoboFlamingo
2023ByteDance Research
RGB images (200x200 CALVIN / 224x224+ real-world) from static and gripper cameras; 6-12 frame observation history per prediction | 5 Hz (one action prediction per forward pass)
Voltron
2023Stanford
224x224 RGB video frames processed as 16x16 patches by ViT (Small or Base scale) | N/A (visual representation model); downstream policies typically operate at 5-20 Hz
R3M
2022Meta AI (Nair et al.)
224x224 RGB frames from egocentric video (pretraining) or robot cameras (downstream) | N/A (visual representation model, not a policy — downstream policies define their own control rate)
MVP (Masked Visual Pre-training)
2022UC Berkeley
224x224 RGB images (single or multi-view) | Determined by downstream policy (typically 10-50 Hz)
VC-1
2023Meta AI
224x224 RGB images normalized to ImageNet statistics, processed as 16x16 patches by ViT-L | N/A (visual representation model); downstream policies typically operate at 5-20 Hz
Theia
2024The AI Institute (Boston Dynamics AI Institute)
224x224 RGB images (standard ViT patch-16 input) | N/A (representation model); downstream policies typically operate at 5-50 Hz
SuSIE
2024UC Berkeley
256x256 RGB images (third-person camera) | ~2 Hz subgoal generation, 5-10 Hz low-level motor commands
GENIMA
2024Dyson Robot Learning Lab
Single or multi-view RGB images (128x128 sim, 480x640 real) with ControlNet conditioning | 10 Hz
GR-2
2024ByteDance Research
Multi-view RGB video frames tokenized via VQGAN into discrete tokens | 10 Hz
HumanPlus
2024Stanford
Shadowing: single RGB camera + 3D pose estimation; Autonomous: two head-mounted egocentric RGB cameras (480x640) | 30 Hz (both shadowing and autonomous execution)
CrossFormer
2024UC Berkeley / CMU
Variable: 1-4 camera views at 224x224 RGB + variable-dim proprioceptive state, per embodiment | Variable per embodiment (2-50 Hz); action scaling handled by per-embodiment detokenizer
HPT
2024MIT CSAIL / Meta FAIR
Heterogeneous: any combination of RGB, depth, point clouds, proprioception -- tokenized to 32 fixed tokens via embodiment-specific stems | Variable per embodiment (matches source dataset control rate)
Pi-0.5
2025Physical Intelligence
Multi-view RGB (up to 3 cameras: primary third-person + left/right wrist) + proprioceptive state (joint positions, velocities, gripper state) | 50 Hz (50-step action chunks predicted per inference pass)
Need Data for a Different Model?
We deliver datasets formatted for any robot learning architecture. Tell us your model and we will match the exact data specification.
Get Data for Your Model