Training Data for HPT (Heterogeneous Pre-trained Transformers)

A detailed breakdown of HPT's modular stem-trunk-head architecture, its pretraining across 52 heterogeneous datasets with 200K+ trajectories, scaling behavior up to 1B parameters, and how Claru provides multi-modal robot data for HPT pretraining and fine-tuning.

OrganizationMIT CSAIL / Meta FAIR

Year2024

Input/Output Specification

Observation

Heterogeneous: any combination of RGB, depth, point clouds, proprioception -- tokenized to 32 fixed tokens via embodiment-specific stems

Action

Variable action spaces (joint positions, velocities, end-effector deltas) decoded by embodiment-specific heads

Language

Task embeddings via learned tokens (language can be added through vision or dedicated stem)

Control Freq

Variable per embodiment (matches source dataset control rate)

How Claru Data Integrates with HPT

Claru provides multi-modal robot data spanning the diversity HPT benefits from most. Our catalog includes data from 5+ robot platforms (Franka Panda, UR5e, xArm, Unitree, custom humanoids) with multiple sensor configurations (single and multi-view RGB at various resolutions, depth cameras, wrist-mounted force/torque sensors). Each dataset includes time-synchronized visual observations, complete proprioceptive state recordings (all DoFs, positions, velocities, and optionally torques), and action labels at the robot's native control rate. For HPT pretraining, we can contribute new heterogeneous datasets that add unique embodiment-sensor combinations to the mixture, directly improving the pretrained trunk's representational breadth. For fine-tuning on a new platform, we deliver 50-2,000 demonstrations with the metadata (sensor specs, URDF, action space definition) needed to configure HPT's new stems and heads. All data is in HDF5 format compatible with HPT's GitHub codebase.

What Is HPT?

HPT (Heterogeneous Pre-trained Transformers) is a scalable framework for pretraining robot policies across heterogeneous data sources, developed by Lirui Wang and Xinlei Chen (MIT CSAIL) together with Jialiang Zhao and Kaiming He (Meta FAIR). Published as a NeurIPS 2024 Spotlight paper (arXiv 2409.20537), HPT addresses a fundamental challenge in robot foundation models: how to pretrain a single policy network on data from dozens of different robot embodiments, sensor configurations, and task domains without requiring a unified data format.

The key insight is a modular architecture split into three components: embodiment-specific stems that tokenize heterogeneous inputs into a fixed-length token sequence, a shared transformer trunk that processes these tokens into a universal representation, and task/embodiment-specific heads that decode actions for the target robot. By standardizing the interface between stems and trunk (a fixed number of tokens, typically 16 for vision and 16 for proprioception), HPT can pretrain its trunk on data from 52 different datasets spanning simulation, real-world teleoperation, human demonstration video, and deployed robot logs -- totaling over 200,000 trajectories.

HPT demonstrates clear scaling behavior: increasing the trunk size up to 1 billion parameters and the number of pretraining datasets up to 52 consistently improves downstream fine-tuning performance. Pretrained HPT policies outperform baselines by over 20% on unseen tasks in multiple simulation benchmarks and real-world evaluations, including dynamic contact-rich manipulation and long-horizon assembly tasks.

HPT at a Glance

Heterogeneous datasets in pretraining

200K+

Robot trajectories in training mixture

Maximum trunk parameters tested

16+16

Vision + proprioception tokens per stem

>20%

Fine-tuning improvement over baselines

Data categories: real teleop, human video, sim, deployed

Input / Output Specification

Parameter	Specification
Vision Stem	Pre-trained image encoders (ViT, ResNet) map camera views to features; an attention mechanism compresses to 16 fixed tokens
Proprioception Stem	MLP maps variable-dimension proprioceptive state to a feature vector; 16 learnable query tokens attend to this feature
Trunk	Shared transformer (scalable to 1B parameters) processes concatenated 32-token sequence into universal representations
Action Head	Embodiment-specific decoder maps trunk output to the target action space (variable dimensionality per robot)
Language Conditioning	Task embeddings (not natural language) via learned task tokens; language can be added through the vision stem
Control Frequency	Variable per embodiment (matches the control rate of each source dataset)

Architecture and Key Innovations

HPT's architecture is deliberately modular to accommodate the extreme heterogeneity of robot data. The vision stem takes camera images from any number of views and any resolution, processes them through a pre-trained image encoder (options include ViT-B, ViT-L, or ResNet-50), and uses a cross-attention mechanism with 16 learnable query tokens to compress the variable-length visual features into a fixed 16-token sequence. This compression is critical -- it means the trunk never needs to handle variable-length inputs regardless of how many cameras or what resolution the source data uses.

The proprioception stem handles the even more heterogeneous proprioceptive inputs. Different robots have different numbers of joints (from 6-DoF arms to 33-DoF humanoids), different state representations (joint positions, velocities, torques, end-effector poses), and different coordinate frames. HPT's proprioception stem uses a learnable MLP to map any proprioceptive vector to a fixed-dimension feature, then applies the same 16-learnable-query cross-attention mechanism to produce 16 proprioception tokens. This design means a single pretrained trunk can process data from a 6-DoF Franka arm and a 33-DoF humanoid using only different stem weights.

The shared transformer trunk is the core pretrained component. It processes the concatenated 32-token sequence (16 vision + 16 proprioception) through standard transformer blocks. The trunk weights are shared across all datasets during pretraining and transferred to new embodiments during fine-tuning. HPT's experiments show that scaling the trunk from small (tens of millions of parameters) to large (1 billion parameters) consistently improves downstream performance, indicating that the trunk is learning genuinely useful cross-embodiment representations rather than memorizing dataset-specific patterns.

The action head is an embodiment-specific decoder that maps the trunk's output representation to the target robot's action space. During pretraining, each dataset has its own head. During fine-tuning on a new embodiment, a new head is initialized while the pretrained trunk is transferred (optionally frozen or fine-tuned at a lower learning rate). The modular head design means HPT never needs to reconcile different action spaces into a single representation -- each head simply decodes to whatever the target robot expects (joint positions, velocities, end-effector deltas, etc.).

Comparison with Related Models

How HPT compares to alternative cross-embodiment robot pretraining approaches.

Dimension	HPT	Octo	OpenVLA	GR00T N1
Architecture	Modular stems + shared trunk + heads	Transformer with diffusion head	VLM with action token head	VLM + Diffusion Transformer
Pretraining datasets	52 datasets (200K+ trajectories)	Open X-Embodiment (800K episodes)	Open X-Embodiment (970K episodes)	Real robot + human video + synthetic
Max parameters	1B (trunk only)	93M	7B	2.2B
Heterogeneous inputs	Yes (any sensor via stems)	Limited (fixed observation format)	RGB only	Yes (embodiment-specific encoders)
Language conditioning	Task embeddings (optional language)	Natural language	Natural language	Natural language + video

Training Data Requirements

HPT's pretraining data mixture is one of the most diverse in robot learning. The published model was pretrained on 52 datasets totaling over 200,000 trajectories, organized into four categories: real-world teleoperation data (human operators controlling robots via haptic devices, VR controllers, or kinesthetic teaching), human demonstration video (third-person and egocentric recordings of humans performing tasks), simulation data (trajectories from physics simulators like MuJoCo, Isaac Gym, and RoboCasa), and deployed robot logs (data from autonomous robot policies running in production).

Each dataset in the mixture has its own observation format (different camera configurations, resolutions, and proprioceptive state dimensions) and action space (different DoF counts, coordinate frames, and control modes). HPT handles this heterogeneity through its modular stem design -- each dataset gets its own stem configuration that tokenizes its specific inputs into the shared 32-token format. The trunk then processes all datasets uniformly.

For teams fine-tuning HPT on a new embodiment or task, the key data requirement is providing enough demonstrations to train the new stem and head while leveraging the pretrained trunk. HPT's experiments show that the pretrained trunk improves fine-tuning performance by over 20% compared to training from scratch, even on tasks and embodiments not seen during pretraining. For a new single-arm manipulation task, 50-200 demonstrations is typically sufficient. For a new embodiment with significantly different morphology, 500-2,000 demonstrations covering diverse tasks provides enough signal to train effective new stems and heads.

Data quality requirements for HPT pretraining favor diversity over volume for any single dataset. The scaling experiments show that adding more datasets (even small ones with only hundreds of trajectories) consistently improves downstream performance, while simply adding more trajectories from existing datasets shows diminishing returns. This means HPT benefits most from broad coverage across robot platforms, environments, and task types rather than deep coverage of a narrow domain.

How Claru Data Integrates with HPT

Claru provides multi-modal robot data that directly supports HPT's heterogeneous pretraining paradigm. Our datasets span multiple robot platforms (Franka Panda, UR5e, xArm, Unitree), multiple sensor configurations (single-view and multi-view RGB, depth, point clouds, force/torque), and multiple control modes (joint positions, end-effector deltas, joint velocities). This diversity is exactly what HPT's pretraining benefits from most -- each new dataset from a different platform and sensor configuration adds a new stem to the mixture and improves the shared trunk's representations.

For HPT fine-tuning, Claru delivers demonstration datasets with the specific observation and action formats your target embodiment uses. Each dataset includes synchronized visual observations (RGB and optionally depth/point cloud), proprioceptive state recordings (joint positions, velocities, and optionally torques), and action labels at the embodiment's native control rate. We provide the metadata (sensor specifications, robot URDF, action space definition) needed to configure HPT's stem and head for your specific robot.

HPT's stem architecture requires that the vision and proprioception inputs are time-synchronized and that the proprioceptive state includes all degrees of freedom the robot uses for control. Claru's data collection pipeline ensures sub-millisecond synchronization between camera frames and proprioceptive state recordings, and our quality validation checks verify that every trajectory has complete, gap-free proprioceptive coverage. All data is delivered in HDF5 format compatible with HPT's open-source training code on GitHub.

Key References

[1]Wang, Zhao, Chen, & He. “Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers.” NeurIPS 2024 (Spotlight), 2024. Link
[2]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” arXiv 2405.12213, 2024. Link
[3]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv 2406.09246, 2024. Link
[4]O'Neill et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
[5]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link

Frequently Asked Questions

HPT uses embodiment-specific stems that tokenize heterogeneous inputs into a fixed-length token sequence. The vision stem uses a pre-trained image encoder plus cross-attention with 16 learnable query tokens to handle any number of camera views at any resolution. The proprioception stem uses an MLP plus cross-attention with 16 query tokens to handle any proprioceptive state dimension. This produces a uniform 32-token input for the shared transformer trunk, regardless of the source robot's sensor configuration.

HPT shows consistent improvements along two scaling axes. First, increasing the trunk size from small to 1B parameters improves downstream fine-tuning performance. Second, increasing the number of pretraining datasets from a handful to 52 also improves performance, with each additional dataset contributing marginal gains even when it contains only hundreds of trajectories. The scaling curves have not yet saturated, suggesting that larger trunks and more diverse data mixtures will continue to improve results.

HPT's published implementation uses learned task embeddings rather than natural language conditioning. However, the architecture is flexible -- language instructions can be integrated through the vision stem (e.g., by rendering text on the image or using a CLIP-style encoder) or by adding a dedicated language stem that produces additional tokens for the trunk. The modular design means language conditioning can be added without changing the trunk architecture.

HPT's pretrained trunk provides strong initialization that reduces fine-tuning data requirements by approximately 20% compared to training from scratch. For a new single-arm manipulation task on a supported embodiment, 50-200 demonstrations is typically sufficient. For a new embodiment with different morphology, 500-2,000 demonstrations covering diverse tasks are recommended to train the new stem and head while leveraging the pretrained trunk.

Claru delivers demonstration datasets in HDF5 format compatible with HPT's open-source training code (github.com/liruiw/HPT). Each dataset includes synchronized RGB images (and optionally depth/point clouds), proprioceptive state recordings with all DoFs at the robot's native control rate, and action labels. We also provide the sensor metadata, robot URDF, and action space specification needed to configure HPT's embodiment-specific stems and heads for your platform.

Related Resources

Glossary

Vla→

Glossary

Foundation Model Robotics→

How To Create Action Labels For Vla→

Guide

How To Collect Teleoperation Data→

Guide

How To Build A Cross Embodiment Dataset→

Get HPT-Ready Multi-Modal Robot Data

Tell us about your HPT project -- target robot platform, sensor configuration, and task domains -- and we will deliver diverse, multi-modal demonstration datasets optimized for HPT's heterogeneous pretraining pipeline.

Get in Touch Browse the Data Catalog