Training Data for CrossFormer

CrossFormer is a cross-embodiment transformer policy from UC Berkeley and CMU that handles heterogeneous observation and action spaces through embodiment-specific tokenization layers. Pretrained on 900K+ episodes from the Open X-Embodiment dataset spanning 20+ robot platforms, CrossFormer can be fine-tuned to new embodiments with as few as 50 demonstrations. This page details CrossFormer's data requirements and how Claru provides multi-platform demonstration data.

OrganizationUC Berkeley / CMU

Year2024

Input/Output Specification

Observation

Variable: 1-4 camera views at 224x224 RGB + variable-dim proprioceptive state, per embodiment

Action

Variable continuous action dimensions per embodiment, predicted as 4-step diffusion action chunks

Language

Natural language instructions via T5-base encoder (optional per embodiment)

Control Freq

Variable per embodiment (2-50 Hz); action scaling handled by per-embodiment detokenizer

How Claru Data Integrates with CrossFormer

Claru provides cross-embodiment demonstration data from multiple robot platforms in RLDS format with OXE-standard metadata. Our collection spans WidowX 250, Franka Emika, UR5e, and ALOHA configurations, covering the most common embodiments in CrossFormer's pretraining distribution. For fine-tuning on supported embodiments, we deliver 100-500 demonstrations per task with matched camera configurations, image resolutions, and control frequencies. For new embodiment onboarding, we collect the diverse multi-task corpus (200-1,000 episodes across 5+ tasks) needed to train effective tokenizer/detokenizer layers. All data includes per-embodiment action normalization statistics and validated RLDS metadata specifying embodiment name, action dimension, and camera configuration.

What Is CrossFormer?

CrossFormer is a cross-embodiment robot policy introduced by Doshi, Walke, Mees, Dasari, and Levine at UC Berkeley and Carnegie Mellon University in 2024. It addresses a fundamental challenge in generalist robot learning: different robots have different numbers of cameras, different action spaces (some use 6-DoF end-effector control, others use 7-DoF joint positions, others use 2-DoF base velocity), and different observation modalities. Prior approaches either forced all robots into a lowest-common-denominator representation (losing information) or trained separate models per embodiment (losing cross-embodiment transfer).

CrossFormer's key innovation is a set of embodiment-specific tokenizer and detokenizer layers that sit between the robot's raw observation/action space and a shared transformer backbone. The tokenizer maps each robot's unique observations (variable numbers of camera views, proprioceptive dimensions, and language inputs) into a fixed set of tokens that the shared transformer can process. The detokenizer maps the transformer's output back into the robot's specific action space. The shared transformer learns manipulation knowledge that transfers across embodiments, while the tokenizer/detokenizer layers handle the embodiment-specific translation.

CrossFormer was pretrained on a large subset of the Open X-Embodiment (OXE) dataset — approximately 900K episodes spanning over 20 robot platforms including WidowX 250, Franka Emika, Google Robot, ALOHA, xArm, UR5, and others. After pretraining, CrossFormer can be fine-tuned to a new embodiment using only 50-200 demonstrations, and the fine-tuned policy outperforms policies trained from scratch on the same target data.

The model is open-source and builds on the Octo codebase, making it straightforward to integrate into existing OXE-compatible training pipelines. CrossFormer represents the current state of the art for cross-embodiment transfer in manipulation, outperforming both Octo and RT-2-X on several benchmark tasks when fine-tuned with limited target-embodiment data.

CrossFormer at a Glance

900K+

Pretraining episodes (OXE subset)

20+

Robot platforms in pretraining data

50-200

Demos for effective fine-tuning

130M

Approximate model parameters

4-step

Action chunk prediction horizon

RLDS

Primary data format

Input / Output Specification

Parameter	Specification
Image observations	Variable: 1-4 camera views, resized to 224x224 RGB; tokenized by per-embodiment ViT encoder
Proprioceptive state	Variable dimension per embodiment (e.g., 7-dim for WidowX, 14-dim for ALOHA); tokenized by learned linear projections
Action space	Variable dimension per embodiment; detokenized from shared transformer output via per-embodiment MLP heads
Action chunks	4-step action chunks predicted per forward pass
Language conditioning	Natural language instructions encoded via T5-base; optional per embodiment
Control frequency	Variable per embodiment (typically 2-50 Hz); handled by per-embodiment action scaling
Data format	RLDS (TensorFlow Datasets) following OXE conventions
Supported embodiments	Any embodiment with RGB images + continuous actions; new embodiments require training tokenizer/detokenizer

Architecture and Key Innovations

CrossFormer's architecture has three distinct components: embodiment-specific input tokenizers, a shared transformer backbone, and embodiment-specific action detokenizers. During both pretraining and fine-tuning, only the tokenizers and detokenizers for the active embodiment(s) are updated alongside the shared backbone. When fine-tuning on a new embodiment, new tokenizer and detokenizer layers are initialized while the shared backbone is loaded from the pretrained checkpoint.

The input tokenizer for each embodiment consists of a ViT-S/16 image encoder (one per camera view), a linear projection for proprioceptive state, and the shared T5-base language encoder. Each tokenizer maps its inputs into a fixed number of tokens (typically 16-64 tokens per observation) with the same embedding dimension as the shared transformer (768-dim). This fixed-size token representation is what enables heterogeneous inputs to be processed by the same transformer.

The shared transformer backbone is a standard decoder-only transformer with 12 layers, 12 attention heads, and 768-dim hidden states. It processes sequences of observation tokens with causal attention and outputs a set of action tokens. The action tokens are passed to the embodiment-specific detokenizer, which is a small MLP (2 layers, 256 hidden units) that maps the transformer's output to the target action dimension and applies per-embodiment action scaling.

A critical design choice is the action chunk prediction with a diffusion head, inherited from Octo. Rather than predicting a single action, CrossFormer predicts 4-step action chunks using DDPM denoising with 10 diffusion steps. This provides temporal consistency and handles multi-modal action distributions. The diffusion head operates in the detokenizer space, so it is embodiment-specific.

CrossFormer's pretraining uses a dataset mixture strategy that upweights high-quality datasets. BridgeData V2 and DROID receive higher sampling weights than smaller or lower-quality datasets. The mixture is balanced so that each embodiment contributes roughly proportionally to the gradient signal, preventing the largest datasets from dominating. This balancing is critical — without it, the shared backbone overfits to the distribution of the largest embodiment and transfers poorly to smaller ones.

Comparison with Related Cross-Embodiment Models

Attribute	CrossFormer	Octo	RT-2-X	HPT
Cross-embodiment mechanism	Per-embodiment tokenizers + shared trunk	Task-specific adapters + shared trunk	Unified action tokenization	Heterogeneous pre-training with stem/trunk/head
Pretraining data	900K+ OXE episodes	800K+ OXE episodes	~1M mixed-embodiment	200K+ multi-embodiment
Open-source	Yes	Yes	No	Yes
Fine-tuning data needed	50-200 demonstrations	25-200 demonstrations	100+ demonstrations	50-200 demonstrations
Action representation	Diffusion, 4-step chunks	Diffusion, 4-step chunks	Discretized tokens	Continuous, variable dim
New embodiment support	Train new tokenizer/detokenizer only	New adapter heads	Requires retraining	New stem/head modules

Training Data Requirements

CrossFormer's data requirements depend on whether you are contributing to pretraining, fine-tuning an existing CrossFormer checkpoint, or training tokenizer/detokenizer layers for a new embodiment. Each scenario has different volume and format requirements.

For pretraining contributions, data must be in RLDS format following OXE conventions. Each episode needs: RGB images from one or more cameras (resized to 256x256 before augmentation crops to 224x224), proprioceptive state as a float vector, continuous actions in the robot's native control space, and optionally a natural language instruction. The RLDS metadata must specify the embodiment name, action dimension, proprioceptive dimension, and number of camera views. The OXE dataset builder template provides the required schema.

For fine-tuning CrossFormer to a new task on an already-supported embodiment (e.g., WidowX 250, Franka Emika), you need 50-200 demonstrations of the target task in RLDS format. The Doshi et al. paper shows that 100 demonstrations on a WidowX 250 achieve within 5% of the performance of 500 demonstrations when starting from the CrossFormer pretrained checkpoint. This is 2-3x more sample-efficient than Octo fine-tuning on the same tasks.

For adding a new embodiment not seen during pretraining, you need to train new tokenizer and detokenizer layers. This requires 200-1,000 demonstrations on the new embodiment, covering a variety of tasks (not just the target task). The more diverse the training tasks, the better the tokenizer/detokenizer layers learn to represent the new embodiment's observation and action space. Doshi et al. recommend at least 5 distinct tasks for tokenizer training.

Data quality considerations mirror those of Octo and BridgeData V2: demonstrations should be successful completions of the intended task, with consistent camera viewpoints within an embodiment, and actions in continuous control space (not discretized). Action normalization to zero-mean, unit-variance per dimension is handled by the training pipeline but requires accurate dataset statistics. Language instructions, if present, should use simple imperative sentences that match the style of the OXE pretraining data.

How Claru Data Integrates with CrossFormer

Claru provides cross-embodiment demonstration data from multiple robot platforms, delivered in RLDS format with the OXE-standard metadata that CrossFormer requires. Our current collection capabilities span WidowX 250, Franka Emika, Universal Robots UR5e, and ALOHA bimanual configurations — covering the most common embodiments in the CrossFormer pretraining distribution.

For teams fine-tuning CrossFormer on a supported embodiment, we deliver targeted task-specific datasets of 100-500 demonstrations per task. Each dataset includes the correct RLDS feature specification, per-embodiment action normalization statistics, and language annotations. We match the camera configurations, image resolutions, and control frequencies of the pretraining data for the target embodiment to minimize distributional shift.

For teams onboarding a new embodiment to CrossFormer, we collect the diverse multi-task demonstration corpus (200-1,000 episodes across 5+ tasks) needed to train effective tokenizer/detokenizer layers. We work with your team to define the task distribution and ensure sufficient coverage of the embodiment's workspace and manipulation capabilities.

Our quality pipeline validates cross-embodiment data compatibility: action dimensions match the declared embodiment specification, image resolutions are consistent within each embodiment, proprioceptive state vectors have the correct dimension and value ranges, and language annotations follow OXE conventions. This validation prevents silent data corruption that could degrade the shared transformer backbone during co-training.

Key References

[1]Doshi et al.. “Scaling Cross-Embodied Learning: One Policy for Manipulation across Robots with Different Observations and Action Spaces.” CoRL 2024, 2024. Link
[2]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link
[3]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
[4]Liang et al.. “HPT: Scaling Heterogeneous Pre-Training for Robot Manipulation.” arXiv 2409.20537, 2024. Link
[5]Walke et al.. “BridgeData V2: A Dataset for Robot Learning at Scale.” CoRL 2023, 2023. Link

Frequently Asked Questions

CrossFormer uses per-embodiment tokenizer and detokenizer layers. Each embodiment's tokenizer maps its specific observation configuration (e.g., 1 camera + 7-dim proprioception for WidowX, or 4 cameras + 14-dim proprioception for ALOHA) into a fixed-size set of tokens. The shared transformer processes these tokens identically regardless of source embodiment. The detokenizer then maps transformer outputs back to the embodiment's specific action dimension. New embodiments require training new tokenizer/detokenizer layers, which takes 200-1,000 demonstrations.

CrossFormer's shared transformer backbone encodes manipulation knowledge that transfers across embodiments — things like how to approach an object, when to close a gripper, and how to handle visual occlusion. This means a new embodiment can leverage 900K+ episodes of cross-embodiment experience through the pretrained backbone, even though none of that data was collected on the target robot. In practice, this yields 15-30% higher success rates when fine-tuning with 50-200 demonstrations, compared to training from scratch on the same data.

Yes, that is one of CrossFormer's primary design goals. You need to train new tokenizer and detokenizer layers for your embodiment, which requires 200-1,000 demonstrations across at least 5 different tasks. The shared transformer backbone is initialized from the pretrained checkpoint and fine-tuned alongside the new layers. This typically takes 4-8 hours on a single GPU and achieves significantly better performance than training from scratch.

Octo uses task-specific adapter heads that map to predefined action space categories (e.g., EE delta, joint position). CrossFormer replaces this with embodiment-specific tokenizer/detokenizer pairs that can handle any continuous action space without predefined categories. CrossFormer also uses a different pretraining mixture strategy that better balances across embodiments. On benchmark tasks, CrossFormer achieves 5-15% higher success rates than Octo when fine-tuning with limited data (50-200 demonstrations).

CrossFormer uses the same RLDS (TensorFlow Datasets) format as Octo, following OXE conventions. The main difference is in metadata: CrossFormer datasets must specify the embodiment name (used to route to the correct tokenizer/detokenizer), action dimension, proprioceptive dimension, and camera view count. If your data is already in OXE-compatible RLDS format, it works with CrossFormer with minimal modification — you just need to ensure the embodiment metadata is correctly specified.

Related Resources

Foundation Model Robotics→

How To Create Action Labels For Vla→

Guide

How To Collect Teleoperation Data→

Solution

Teleoperation Data→

Get Multi-Robot Training Data for CrossFormer

Tell us about your robot platform and target tasks. We collect cross-embodiment demonstrations in RLDS format, ready for CrossFormer fine-tuning or new-embodiment onboarding.

Get in Touch Browse the Data Catalog