Training Data by Model

Stanford / TRI / UC Berkeley

224x224 RGB images (single third-person view) | ~5 Hz (autoregressive); ~130 Hz with OpenVLA-OFT parallel decoding

RT-2

Google DeepMind

320x320 RGB images (single head-mounted camera) | 3 Hz

Octo

UC Berkeley

256x256 RGB (supports multi-view: primary + wrist) | ~10 Hz (variable per embodiment)

Pi-Zero (pi0)

Physical Intelligence

Multi-view RGB images (224x224 per view, 2-3 cameras) + natural language instruction | 50 Hz (action chunks decoded in a single forward pass)

GR00T N1

2025

NVIDIA

Multi-view RGB images at configurable resolution + proprioceptive state (joint positions, velocities) | System 2 at 10 Hz (reasoning); System 1 at higher rates (motor control)

Diffusion Policy

Columbia / MIT / TRI

1-3 camera views at 96x96 (CNN) or 224x224 (ViT) RGB + optional proprioceptive state | 10-50 Hz (action chunks at ~25 Hz prediction frequency with 8-step execution overlap)

ACT (Action Chunking with Transformers) / ALOHA

Stanford / Google DeepMind

4x 480x640 RGB images (2 wrist-mounted + 2 third-person) + 14-dim joint positions | 50 Hz

RT-1

Google (Everyday Robots)

300x300 RGB images (single head-mounted camera), 6-frame history (current + 5 previous) | 3 Hz

BridgeData V2

UC Berkeley

256x256 RGB from third-person camera (optional wrist camera); 8-dim proprioceptive state | 5 Hz

Gato

Google DeepMind

RGB images tokenized into 16x16 patches via ResNet encoder + continuous proprioceptive state mu-law encoded to 1024 bins | Variable per environment: ~5-20 Hz for robotics tasks, up to 60 Hz for Atari games

PaLM-E

Google DeepMind / TU Berlin

Multi-modal: 224x224 or 512x512 RGB images (ViT-22B encoder) + optional scene state vectors and object-centric representations | Plan-level (~1-2 Hz for plan generation); downstream motor controllers operate at 3-10 Hz

RoboCat

Google DeepMind

Multi-view RGB images (overhead + wrist cameras) tokenized via ViT, interleaved with proprioception tokens | 5-10 Hz depending on embodiment (Sawyer at 5 Hz, KUKA at 10 Hz)

RoboFlamingo

ByteDance Research

RGB images (200x200 CALVIN / 224x224+ real-world) from static and gripper cameras; 6-12 frame observation history per prediction | 5 Hz (one action prediction per forward pass)

Voltron

Stanford

224x224 RGB video frames processed as 16x16 patches by ViT (Small or Base scale) | N/A (visual representation model); downstream policies typically operate at 5-20 Hz

R3M

Meta AI (Nair et al.)

224x224 RGB frames from egocentric video (pretraining) or robot cameras (downstream) | N/A (visual representation model, not a policy — downstream policies define their own control rate)

MVP (Masked Visual Pre-training)

UC Berkeley

224x224 RGB images (single or multi-view) | Determined by downstream policy (typically 10-50 Hz)

VC-1

Meta AI

224x224 RGB images normalized to ImageNet statistics, processed as 16x16 patches by ViT-L | N/A (visual representation model); downstream policies typically operate at 5-20 Hz

Theia

The AI Institute (Boston Dynamics AI Institute)

224x224 RGB images (standard ViT patch-16 input) | N/A (representation model); downstream policies typically operate at 5-50 Hz

SuSIE

UC Berkeley

256x256 RGB images (third-person camera) | ~2 Hz subgoal generation, 5-10 Hz low-level motor commands

GENIMA

Dyson Robot Learning Lab

Single or multi-view RGB images (128x128 sim, 480x640 real) with ControlNet conditioning | 10 Hz

GR-2

ByteDance Research

Multi-view RGB video frames tokenized via VQGAN into discrete tokens | 10 Hz

HumanPlus

Stanford

Shadowing: single RGB camera + 3D pose estimation; Autonomous: two head-mounted egocentric RGB cameras (480x640) | 30 Hz (both shadowing and autonomous execution)

CrossFormer

UC Berkeley / CMU

Variable: 1-4 camera views at 224x224 RGB + variable-dim proprioceptive state, per embodiment | Variable per embodiment (2-50 Hz); action scaling handled by per-embodiment detokenizer

HPT