Training Data for GR00T N1
A detailed breakdown of NVIDIA's GR00T N1 humanoid foundation model -- its dual-system architecture with Eagle-2 VLM and Diffusion Transformer, the heterogeneous data mixture it trains on, and how Claru provides the human video and robot demonstration data GR00T N1 requires.
Input/Output Specification
Multi-view RGB images at configurable resolution + proprioceptive state (joint positions, velocities)
Continuous joint-position targets via Diffusion Transformer with action flow matching; embodiment-specific encoders/decoders handle variable DoF
Natural language instructions processed by Eagle-2 VLM (SigLIP-2 + SmolLM2); also supports video demonstration conditioning
System 2 at 10 Hz (reasoning); System 1 at higher rates (motor control)
How Claru Data Integrates with GR00T N1
Claru provides data across all three layers of GR00T N1's training mixture. For real robot trajectories (Layer 1), we deliver teleoperated demonstrations on humanoid and bimanual platforms with multi-view RGB, full joint-state recordings at 50+ Hz, and natural language task labels in LeRobot-compatible HDF5 format. For human video (Layer 2), our catalog of 3M+ egocentric activity videos -- spanning kitchen tasks, tool use, assembly, and daily activities from head-mounted cameras -- provides the raw footage needed to train GR00T N1's VQ-VAE latent-action codebook. For synthetic data (Layer 3), we provide curated simulation datasets from NVIDIA Isaac Sim with ground-truth actions and domain-randomized visuals. All deliverables include camera calibration, trajectory success labels, and quality metadata. We support fine-tuning workflows for new humanoid embodiments with 500-20,000 demonstration packages tailored to your platform's DoF count and task families.
What Is GR00T N1?
GR00T N1 is an open foundation model for generalist humanoid robots, developed by NVIDIA and published in March 2025 (arXiv 2503.14734). It is a Vision-Language-Action (VLA) model with a dual-system architecture inspired by human cognitive processing: a System 2 reasoning module (a pretrained Vision-Language Model) handles task understanding and environmental interpretation, while a System 1 action module (a Diffusion Transformer) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end.
The publicly released GR00T-N1-2B model contains 2.2 billion parameters in total, with 1.34 billion in the VLM backbone (NVIDIA Eagle-2, fine-tuned from SmolLM2 and SigLIP-2). The model supports cross-embodiment deployment from tabletop robot arms to full-size humanoid robots, and was validated on platforms including the Fourier GR-1 humanoid, ALOHA bimanual arms, and single-arm manipulation setups. Pretraining consumed approximately 50,000 NVIDIA H100 GPU hours using up to 1,024 GPUs.
A defining feature of GR00T N1 is its ability to learn from heterogeneous data sources -- not just teleoperated robot trajectories, but also human egocentric videos and synthetically generated trajectories. For data sources that lack action labels (like human videos), GR00T N1 employs a learned latent-action codebook and trained inverse dynamics models (IDMs) to infer pseudo-actions, enabling the model to extract manipulation knowledge from video datasets that were never collected with robots.
GR00T N1 at a Glance
Input / Output Specification
| Parameter | Specification |
|---|---|
| System 2 (Reasoning) | Eagle-2 VLM (SigLIP-2 image encoder + SmolLM2 LLM) processes multi-view RGB images and language instructions at 10 Hz |
| System 1 (Action) | Diffusion Transformer with action flow-matching cross-attends to VLM output tokens and generates motor actions via embodiment-specific encoders/decoders |
| Observation Format | Multi-view RGB images at configurable resolution, plus proprioceptive state (joint positions, velocities) |
| Action Format | Continuous joint-position targets generated via flow matching, with embodiment-specific action decoders handling variable DoF counts |
| Language Conditioning | Natural language instructions processed by the Eagle-2 VLM; also supports video demonstration conditioning |
| Control Frequency | System 2 runs at 10 Hz for task reasoning; System 1 Diffusion Transformer can output actions at higher rates for motor control |
Architecture and Key Innovations
GR00T N1's dual-system design is its most distinctive architectural choice. System 2 is the NVIDIA Eagle-2 VLM, which combines a SigLIP-2 image encoder with a SmolLM2 language model. This module processes the robot's camera views and natural language instructions to produce a rich contextual representation of the current task state. Running at 10 Hz on an NVIDIA L40 GPU, it provides the high-level reasoning about what the robot should do and why.
System 1 is a Diffusion Transformer trained with action flow matching. It cross-attends to the VLM's output token sequence and generates continuous motor actions through iterative denoising. The key design choice here is the use of embodiment-specific encoders and decoders: the encoder maps the robot's proprioceptive state (which varies in dimensionality across embodiments) into a fixed-size latent representation, and the decoder maps the Diffusion Transformer's output back to the robot's native action space. This allows a single shared Diffusion Transformer trunk to serve multiple robot morphologies.
The pseudo-action mechanism for training on actionless data is a critical innovation. For human egocentric videos, GR00T N1 trains a VQ-VAE model that extracts features from consecutive video frames and maps them to a discrete latent-action codebook. For synthetically generated video trajectories, a separate Diffusion Transformer-based inverse dynamics model (IDM) is trained to predict actions from consecutive observation pairs. Both mechanisms produce pseudo-action labels that are used as flow-matching targets during training, treating each data source as a separate 'embodiment' with its own encoder/decoder pair.
The end-to-end joint training of System 1 and System 2 ensures tight coupling between reasoning and action. Unlike pipeline approaches where a VLM first produces a plan and a separate controller executes it, GR00T N1's gradient flows through both modules during training, allowing the VLM to learn representations that are directly useful for action generation and the Diffusion Transformer to leverage the VLM's semantic understanding.
Comparison with Related Models
How GR00T N1 compares to alternative humanoid and generalist robot foundation models.
| Dimension | GR00T N1 | OpenVLA | HPT | pi0 |
|---|---|---|---|---|
| Architecture | Dual-system: VLM + Diffusion Transformer | Single VLM with action token head | Modular stems + shared transformer trunk | VLM + flow matching action head |
| Parameters | 2.2B (1.34B VLM) | 7B | Up to 1B | Proprietary (VLM-scale) |
| Target embodiments | Humanoids, bimanual arms, single arms | Single-arm manipulators | Cross-embodiment (any robot) | Cross-embodiment (arms) |
| Actionless data usage | Yes (latent codebook + IDM pseudo-actions) | No (requires action labels) | Partial (human video via separate stem) | No (requires action labels) |
| Open weights | Yes (Hugging Face) | Yes | Yes (GitHub) | No |
Training Data Requirements
GR00T N1 trains on a heterogeneous mixture spanning three data layers. Layer 1 is real robot trajectories with ground-truth action labels -- teleoperated demonstrations on platforms like the Fourier GR-1 humanoid and ALOHA bimanual arms. These trajectories provide multi-view RGB frames, joint-position recordings, and end-effector poses at 10-50 Hz, paired with natural language task descriptions. The Open X-Embodiment dataset is also used for cross-embodiment pretraining.
Layer 2 is human egocentric video without action labels. GR00T N1 uses large-scale human activity datasets where a person performs manipulation tasks filmed from a head-mounted or chest-mounted camera. To extract training signal from these videos, a VQ-VAE model is trained on consecutive frame pairs to learn a latent-action codebook. The resulting discrete latent actions serve as pseudo-action targets during flow-matching training, with the human video treated as a separate 'embodiment' with its own encoder/decoder.
Layer 3 is synthetically generated data -- video trajectories produced by neural video generation models or physics simulators. For neural-generated videos, a Diffusion Transformer-based inverse dynamics model (IDM) is trained to predict actions from observation pairs. For simulator data (e.g., from NVIDIA Isaac Sim), ground-truth actions are available directly. Synthetic data is particularly valuable for scaling the diversity of tasks and environments beyond what is practical to collect physically.
For teams fine-tuning GR00T N1 on a new humanoid platform or task set, NVIDIA's documentation recommends starting with 500-2,000 high-quality teleoperated demonstrations per task family on the target embodiment. For new embodiment integration where the model has no prior exposure to the robot's morphology, 5,000-20,000 demonstrations covering locomotion, manipulation, and whole-body coordination are recommended. The data should include multi-view RGB from at least 2 cameras, full joint-state recordings at 50+ Hz, and natural language task labels.
How Claru Data Integrates with GR00T N1
Claru provides data for all three layers of GR00T N1's training mixture. For Layer 1 (real robot trajectories), we collect teleoperated demonstrations on humanoid and bimanual platforms with multi-view RGB cameras, full joint-state recordings at 50 Hz, and calibrated camera intrinsics/extrinsics. Our data includes natural language task descriptions and trajectory-level success labels, matching the format GR00T N1's training pipeline expects.
For Layer 2 (human video), Claru's catalog of 3M+ egocentric human activity videos provides a rich source of manipulation knowledge. Our egocentric footage spans kitchen tasks, tool use, object rearrangement, personal care, and industrial assembly -- all captured from head-mounted or chest-mounted cameras with verified temporal segmentation. This data can be used to train the VQ-VAE latent-action codebook that GR00T N1 uses to extract pseudo-actions from human video.
For Layer 3 (synthetic data), Claru can provide curated simulation datasets generated in NVIDIA Isaac Sim and other physics engines, with ground-truth action labels and domain-randomized visual conditions. We also provide the paired observation sequences needed to train the inverse dynamics model used for pseudo-action labeling of neural-generated video. All data is delivered in LeRobot-compatible HDF5 format with the metadata schema GR00T N1's open-source training code expects.
Key References
- [1]Bjorck et al.. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” arXiv 2503.14734, 2025. Link
- [2]Black et al.. “pi0: A Vision-Language-Action Flow Model for General Robot Control.” arXiv 2410.24164, 2024. Link
- [3]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv 2406.09246, 2024. Link
- [4]O'Neill et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
- [5]Lipman et al.. “Flow Matching for Generative Modeling.” ICLR 2023, 2023. Link
- [6]Shi et al.. “Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models.” arXiv 2501.14818, 2025. Link
Frequently Asked Questions
Yes. NVIDIA released GR00T-N1-2B as an open foundation model in March 2025, with model weights available on Hugging Face and training/fine-tuning code on GitHub (NVIDIA/Isaac-GR00T). The model can be fine-tuned for specific humanoid platforms and tasks using standard PyTorch workflows. Subsequent versions (N1.5, N1.6) have also been released with improved performance.
GR00T N1 uses embodiment-specific encoders and decoders that can adapt to any robot morphology. The published model was validated on the Fourier GR-1 humanoid (for bimanual manipulation), ALOHA bimanual arms, and standard single-arm setups. For new platforms, fine-tuning with embodiment-specific demonstrations is required. The encoder/decoder design means the shared VLM and Diffusion Transformer trunk transfer across embodiments.
GR00T N1 trains a VQ-VAE model on consecutive video frame pairs to learn a discrete latent-action codebook. For each pair of consecutive frames in a human video, the VQ-VAE encoder extracts a continuous embedding that is mapped to the nearest codebook entry. These discrete latent actions serve as pseudo-action targets during flow-matching training, with the human video treated as a separate embodiment with its own encoder/decoder pair. This allows the model to extract manipulation knowledge from millions of human activity videos.
The published GR00T-N1-2B model required approximately 50,000 NVIDIA H100 GPU hours for pretraining, using up to 1,024 GPUs in a single training run. Fine-tuning on a new embodiment or task set is significantly less expensive -- typically feasible on 8-64 GPUs over a few days, depending on the dataset size and desired level of adaptation.
Claru delivers datasets in LeRobot-compatible HDF5 format matching GR00T N1's open-source training code expectations. Each dataset includes multi-view RGB images, full joint-state recordings at 50+ Hz, natural language task descriptions, and camera calibration files. For human video data, we provide egocentric footage with temporal segmentation suitable for VQ-VAE latent-action codebook training. All data includes trajectory-level success labels and quality metadata.
Get GR00T N1-Ready Training Data
Tell us about your GR00T N1 project -- target humanoid platform, task families, and deployment environment -- and we will deliver multi-modal datasets formatted for GR00T N1's three-layer training pipeline.