Training Data for Gato

Gato is DeepMind's generalist agent — a single 1.2B-parameter transformer trained on 604 distinct tasks spanning Atari games, image captioning, text dialogue, and real-world robot manipulation. By tokenizing all modalities into a unified sequence format, Gato demonstrated that a single set of neural network weights can perform both digital and physical tasks. This page covers Gato's robotics data specification, tokenization scheme, and how Claru provides compatible demonstration data.

OrganizationGoogle DeepMind

Year2022

Input/Output Specification

Observation

RGB images tokenized into 16x16 patches via ResNet encoder + continuous proprioceptive state mu-law encoded to 1024 bins

Action

Continuous joint velocity commands, mu-law encoded and discretized to 1024 integer tokens per dimension

Language

Text-based task conditioning via same token vocabulary (text tokens interleaved with observation/action tokens)

Control Freq

Variable per environment: ~5-20 Hz for robotics tasks, up to 60 Hz for Atari games

How Claru Data Integrates with Gato

Claru provides robot demonstration data pre-tokenized for Gato-style multi-modal transformer architectures. Our pipeline handles mu-law encoding (configurable mu parameter, default 100), uniform discretization to target vocabulary size (1024 bins by default), image patch tokenization, and sequence assembly with correct separator tokens and modality ordering. For teams building new architectures, we also deliver raw continuous data alongside tokenized versions, enabling experimentation with different tokenization schemes. Our collection covers block stacking, pick-and-place, and tabletop rearrangement tasks across multiple robot platforms (WidowX 250, Franka Emika, UR5e), with per-episode task metadata suitable for multi-task conditioning.

What Is Gato?

Gato is a generalist agent published by Scott Reed, Konrad Zolna, Emilio Parisotto, and colleagues at DeepMind in May 2022. The paper's central thesis is provocative: rather than building specialist models for each task, a single large transformer can learn to play Atari games, caption images, chat, stack blocks with a real robot arm, and navigate in simulated environments — all with the same weights. Gato achieved this by converting every modality (text, images, continuous actions, discrete actions, proprioceptive states) into a common token sequence and training with a standard autoregressive language modeling objective.

The robotics component of Gato is what makes it relevant to the physical AI community. Gato was trained on real-world robot manipulation data from a Sawyer robot arm performing block stacking tasks, as well as simulated robot data from DM Control Suite locomotion tasks and RGB Stacking benchmarks. On the real Sawyer stacking task, Gato achieved 87% success averaged across 1-5 block configurations — competitive with specialist policies trained exclusively on stacking data.

Gato was a proof-of-concept rather than a production-ready robotics model. With 1.2 billion parameters, it is smaller than most modern VLAs (OpenVLA has 7B, RT-2 has 55B). Its robotics performance, while impressive for a generalist, falls short of specialist models on difficult manipulation tasks. The real significance of Gato is architectural: it demonstrated that the unified tokenization paradigm works for physical control, directly inspiring the VLA (Vision-Language-Action) model family that followed — including RT-2, Octo, and OpenVLA.

Understanding Gato's data requirements is relevant for two audiences: teams building on Gato-style multi-task architectures (unified tokenization of diverse modalities), and teams interested in how multi-task pretraining on non-robotics data (games, text, images) can improve robotics policy performance. The data pipeline for Gato-style models is fundamentally different from specialist models like ACT or Diffusion Policy — it requires tokenization of continuous values and interleaving of heterogeneous episode formats.

Gato at a Glance

1.2B

Model parameters

604

Distinct tasks in training data

1024

Vocabulary size (tokenized actions/observations)

87%

Success on real Sawyer block stacking

Sequence length (tokens)

~600

Robotics episodes (real Sawyer stacking)

Input / Output Specification (Robotics Component)

Parameter	Specification
Image observations	RGB images tokenized into 16x16 patches via ResNet encoder, then discretized to 1024-vocab tokens
Proprioceptive state	Continuous joint positions/velocities, mu-law encoded and discretized to 1024 bins
Action space	Continuous joint velocity commands, mu-law encoded and discretized to 1024 bins per dimension
Tokenization scheme	All modalities converted to integer tokens in [0, 1023]; separator tokens delineate modality boundaries
Sequence format	Interleaved: [image tokens \| proprioception tokens \| action tokens \| separator] per timestep
Control frequency	Variable per environment; ~5-20 Hz for robotics tasks
Episode context length	Up to 8,192 tokens (~20-50 timesteps depending on observation size)
Robot platforms	Rethink Sawyer (real block stacking), DM Control Suite (simulated locomotion), RGB Stacking (simulated)

Architecture and Key Innovations

Gato's architecture is a decoder-only transformer with 24 layers, 16 attention heads, and a hidden dimension of 2048 — the same fundamental architecture as GPT-2/GPT-3. The key innovation is not in the transformer itself but in the tokenization pipeline that converts heterogeneous data into a unified token vocabulary.

For images, Gato uses a ResNet block to encode each image into a sequence of embedding vectors (one per 16x16 patch), which are then projected into the transformer's embedding space. Unlike language tokens, image tokens retain spatial structure through position embeddings. For the robotics tasks, observations typically include one RGB image from a workspace camera.

For continuous values (joint positions, joint velocities, end-effector positions, force readings), Gato applies mu-law encoding to compress the dynamic range, then uniformly discretizes the result into 1024 bins. Each continuous dimension becomes a single integer token in [0, 1023]. A 7-DoF joint position reading thus becomes 7 tokens. This discretization introduces quantization error — with 1024 bins over a typical joint range of [-pi, pi], the resolution is approximately 0.006 radians (0.35 degrees), which is sufficient for most manipulation tasks but may limit precision on high-accuracy assembly.

The training objective is standard autoregressive next-token prediction. At each position in the sequence, the model predicts the next token given all preceding tokens. For action tokens, this means the model predicts the discretized action given the observation and all preceding actions in the episode. For image tokens, the model predicts the next image patch token. The same cross-entropy loss applies to all token types, with masking applied to image tokens during robotics training (the model is not required to predict images, only actions).

Multi-task learning is handled through prompting: each episode begins with a task-identifying prefix sequence. For robotics tasks, this prefix implicitly identifies the robot embodiment and task through the structure of the observation tokens. The model learns to associate different token patterns with different behavioral policies. At inference time, the model conditions on the task prefix and the current observation tokens, then autoregressively generates action tokens.

Comparison with Related Generalist Models

Attribute	Gato	RT-2	Octo	OpenVLA
Parameters	1.2B	55B (PaLI-X)	93M	7B (Prismatic)
Non-robotics pretraining	Yes (Atari, text, images)	Yes (web-scale VLM)	No (robot-only)	Yes (VLM backbone)
Action representation	Discretized tokens (1024 bins)	Discretized tokens (256 bins)	Continuous (diffusion)	Discretized tokens (256 bins)
Language conditioning	Text tokens in same vocabulary	Natural language via VLM	T5 encoder (optional)	Natural language via VLM
Real robot tasks trained	Block stacking (Sawyer)	700+ skills (Everyday Robot)	BridgeData V2 + OXE	970K episodes (OXE)
Open-source	No	No	Yes	Yes
Year	2022	2023	2024	2024

Training Data Requirements

Gato's training data spans a remarkable breadth of domains. The full training set includes: 604 Atari games with ~200M frames of gameplay, MassiveText corpus for language tasks, image-caption pairs from ALIGN and JFT datasets, DM Control Suite simulated locomotion (23 tasks), DM Lab navigation tasks, simulated RGB Stacking tasks, and real-world Sawyer block stacking demonstrations. The robotics component is a small fraction of the total data volume — approximately 600 real Sawyer episodes and tens of thousands of simulated episodes.

For teams building Gato-style architectures, the robotics data must be tokenized following the specific scheme: continuous values are mu-law encoded (mu=100), then uniformly quantized to 1024 integer bins. Each observation-action timestep is serialized as a token sequence: [image patch tokens (typically 256-1024 tokens depending on resolution) | proprioception tokens (7-14 tokens for joint positions/velocities) | action tokens (7 tokens for a 7-DoF arm) | separator token]. Episodes are packed into sequences of up to 8,192 tokens.

The mu-law encoding is critical and often overlooked. Standard uniform discretization of continuous values performs poorly because the distribution of joint values and velocities is not uniform — values cluster near rest positions. Mu-law encoding (with mu=100) compresses the dynamic range so that small values near zero get more bins and large values get fewer bins, matching the empirical distribution. The formula is: encoded = sign(x) * ln(1 + mu * |x|) / ln(1 + mu), then uniformly discretized to [0, 1023].

For real-world robot demonstrations specifically, the Gato paper used data from a Sawyer robot performing block stacking with 1-5 blocks. Each episode contains approximately 100-300 timesteps of RGB images at 64x64 resolution, 7-dim joint positions, 7-dim joint velocities, and 7-dim joint velocity commands as actions. The total real-world robotics dataset is modest — approximately 600 episodes — but the massive non-robotics pretraining data provides a strong prior for visual understanding and sequential decision-making.

Data quality for Gato-style training differs from specialist models. Because the model trains on 604 tasks simultaneously, the gradient signal from any single robotics task is diluted. This means the model is less sensitive to individual bad demonstrations but requires that the overall task distribution be well-represented. For robotics fine-tuning or evaluation, the tokenization must exactly match the pretraining scheme — different mu-law parameters or bin counts will produce meaningless tokens that the model cannot interpret.

How Claru Data Integrates with Gato-Style Architectures

Claru provides robot demonstration data pre-tokenized for Gato-style multi-modal transformer architectures. Our pipeline handles the full tokenization chain: mu-law encoding of continuous values with configurable mu parameter (default 100), uniform discretization to the target vocabulary size (1024 bins by default), image patch tokenization via a provided ResNet encoder, and sequence assembly with correct separator tokens and modality ordering.

For teams fine-tuning or evaluating on Gato-style models, we deliver data in the exact serialized token format the model expects — packed sequences of up to 8,192 tokens with correct episode boundaries. Each delivery includes the tokenization configuration (mu parameter, bin count, image resolution, patch size) and a detokenization script for inspecting the data in human-readable form.

For teams building new Gato-inspired architectures (which is the more common use case in 2024-2026, given that Gato itself is not open-source), we deliver raw demonstration data alongside the tokenized version. The raw data includes RGB images at configurable resolution, continuous joint positions/velocities, continuous actions, and episode metadata. This allows teams to experiment with different tokenization schemes — varying the vocabulary size, mu parameter, or switching to learned tokenization (as in recent VQ-VAE approaches) — without re-collecting demonstrations.

Our collection covers the manipulation tasks most commonly used in Gato-style benchmarks: block stacking (1-5 blocks), pick-and-place with diverse objects, and tabletop rearrangement. We collect on multiple robot platforms (WidowX 250, Franka Emika, UR5e) to support cross-embodiment training. Each delivery includes task-identifying metadata that can serve as the episode prefix tokens for multi-task training.

Key References

[1]Reed et al.. “A Generalist Agent.” TMLR 2022, 2022. Link
[2]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[3]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” CoRL 2024, 2024. Link
[4]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link
[5]Lee et al.. “Multi-Game Decision Transformers.” NeurIPS 2022, 2022. Link

Frequently Asked Questions

No, Gato is not open-source. DeepMind published the architecture and training details but did not release the model weights, training code, or full dataset. However, the architecture is a standard decoder-only transformer, and the tokenization scheme is fully described in the paper. Several open-source reimplementations exist. For practical deployment, most teams now use Octo or OpenVLA, which are open-source and offer better robotics performance than Gato.

Gato discretizes continuous actions into 1024 bins per dimension using mu-law encoding. For a typical joint range of [-pi, pi], this gives approximately 0.006 radian (0.35 degree) resolution. For most manipulation tasks (pick-and-place, stacking, pushing), this resolution is sufficient. For precision assembly tasks requiring sub-millimeter accuracy (peg insertion, connector mating), the quantization error may be limiting. RT-2 and OpenVLA use only 256 bins, which is coarser but still works for most tasks.

The evidence is mixed. Gato showed that a single model can do both, but did not ablate whether Atari/text pretraining improves robotics performance versus robotics-only training. RT-2 showed clear evidence that VLM pretraining on web data improves language-conditioned manipulation. The current consensus is that visual and language pretraining helps (through better perception and instruction following), but game-playing pretraining likely does not transfer to real-world manipulation.

Gato used approximately 600 real Sawyer block stacking demonstrations, supplemented by tens of thousands of simulated episodes. This is a very small number by modern standards — BridgeData V2 has 60K episodes, RT-1 used 130K. The key insight is that Gato's multi-task pretraining provides a strong sequential decision-making prior, so fewer robotics-specific demonstrations are needed. For building a Gato-style system today, 500-5,000 real demonstrations per task type is a reasonable starting point.

Gato was the pioneer of the 'tokenize everything into one sequence' paradigm, but modern VLAs have refined it significantly. RT-2 and OpenVLA use pretrained vision-language model backbones (PaLI-X, Prismatic) that provide much stronger visual and language understanding than training from scratch. They use dedicated action tokenization schemes rather than treating actions the same as text. And they train on orders of magnitude more robotics data. Gato proved the concept; VLAs made it practical.

Related Resources

Glossary

Vla→

Glossary

Foundation Model Robotics→

How To Create Action Labels For Vla→

Guide

How To Collect Teleoperation Data→

Solution

Teleoperation Data→

Get Tokenized Robot Demonstration Data

Tell us about your multi-modal transformer architecture. We deliver robot demonstrations pre-tokenized for Gato-style models or as raw continuous data for custom tokenization pipelines.

Get in Touch Browse the Data Catalog