Last updated: March 2026
VLA Training Data: The Complete Guide (2026)
Vision-Language-Action models are redefining how robots learn. This guide covers every VLA architecture that matters, the datasets they train on, the gaps in publicly available data, and how teams building production robotics systems source the training data their models need.
In This Guide
What Are Vision-Language-Action Models?
A Vision-Language-Action (VLA) model is a multimodal foundation model that takes visual observations and natural language instructions as input and outputs physical robot actions. Unlike a standard vision-language model (VLM) such as GPT-4o or PaliGemma that outputs text, a VLA outputs continuous motor commands: joint positions, end-effector poses, and gripper states that directly control hardware.
The core idea is simple: take the rich perceptual and reasoning capabilities of large pretrained VLMs and extend them to produce actions. During training, the model sees triplets of (image, language_instruction, action_trajectory) and learns to map what the robot sees and the task description to the physical movements needed to complete it.
This architecture has driven a rapid shift in robotics research. At ICLR 2026, there were 164 VLA-related submissions, covering discrete diffusion VLAs, reasoning-augmented models, and novel benchmark designs. VLAs are now the dominant architecture for generalist robot manipulation.
But every VLA model is only as good as its training data. The model architecture is increasingly commoditized — what separates production-grade systems from demo videos is the quality, diversity, and scale of the physical AI training data behind them.
Major VLA Architectures and Their Data Needs
Four VLA architectures define the current landscape. Each has distinct data requirements shaped by its design choices.
OpenVLA (Stanford, 2024)
7B params · Open-source · RLDS format
OpenVLA is the most widely adopted open-source VLA. Built on a fused DINOv2 + SigLIP vision backbone and Llama-2 LLM, it was pretrained on approximately 970,000 trajectories from the Open X-Embodiment dataset (the “Magic Soup++” mixture). Actions are tokenized as text and trained alongside vision-language data.
Data requirements: RLDS format. Supports arbitrary dataset mixtures via configurable mixture weights. Fine-tuning on a new task requires as few as 50–200 demonstrations for simple manipulation, though diverse environments improve generalization substantially. LoRA fine-tuning is supported for compute-efficient adaptation. The team experimented with including the DROID dataset but found action token accuracy remained low, suggesting that highly diverse datasets need either larger mixture weights or bigger models.
RT-2 (Google DeepMind, 2023)
55B params · Closed-source · Tokenized actions
RT-2 pioneered the concept of casting robot actions as text tokens, training jointly on web-scale vision-language data and robot demonstration data collected from 13 robots over 17 months in an office kitchen environment. This co-training lets the model transfer internet-scale knowledge to physical manipulation.
Key result: RT-2 improved performance on unseen scenarios from 32% (RT-1) to 62%, demonstrating that VLM pretraining transfers directly to manipulation generalization. It also showed emergent reasoning: following instructions like “pick up the object that is not a fruit” without those concepts appearing in the robot training data.
pi-zero (Physical Intelligence, 2024)
~3B params · Closed-source · Flow Matching actions
pi-zero takes a different approach to action generation. Instead of tokenizing actions as text, it uses Conditional Flow Matching via a dedicated Action Expert (~300M parameters) to generate continuous action sequences. The VLM backbone (PaliGemma) handles perception and language, while the Action Expert handles the continuous nature of robot control.
Data requirements: Pi-zero uses block-wise causal attention masks to prevent robotics-specific inputs from overwriting the VLM's pretrained knowledge. This design means the model can use web-scale VLM pretraining while training the Action Expert on more modest volumes of robot demonstration data — but the demonstrations must be high-quality, with precise action labels at sub-second temporal resolution.
GR00T N1 (NVIDIA, 2025)
2.2B params · Open-weight · Diffusion Transformer
GR00T N1 is NVIDIA's open foundation model for generalist humanoid robots, built with a dual-system design. System 2 is an Eagle-2 VLM that interprets visual scenes and language instructions. System 1 is a Diffusion Transformer that generates smooth motor actions at 120 Hz by denoising action sequences.
Data requirements: GR00T N1 is trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetic data generated through NVIDIA's DreamGen pipeline. It supports cross-embodiment transfer from tabletop arms to dexterous humanoid robots. The synthetic data pipeline uses world foundation models to generate diverse robot trajectory data, reducing the need for expensive real-world collection — but real-world fine-tuning data remains essential for deployment.
VLA Training Data Requirements
Every VLA training pipeline consumes data in three modalities that must be tightly synchronized. Understanding these requirements is essential for building datasets that actually improve model performance.
1. Visual Observations
RGB images or video frames from the robot's cameras. Most VLAs use a single egocentric (head or wrist) camera, though GR00T N1 and some research models support multi-view inputs. Resolution typically ranges from 224x224 (OpenVLA default) to 512x512 for detail-sensitive tasks.
The visual data must capture the true deployment distribution: real lighting conditions, real clutter, real surface textures. Models trained on sterile lab environments with uniform backgrounds fail when deployed in kitchens, warehouses, or outdoor settings. This is why egocentric video datasets captured in diverse real-world settings are so valuable for VLA pretraining.
2. Natural Language Instructions
Task descriptions that tell the robot what to do: “pick up the red cup and place it on the shelf,” “fold the towel in half,” “open the drawer.” Instructions must be paired with each demonstration trajectory.
Effective VLA training requires instruction diversity: paraphrases of the same task (“grab the mug” vs. “pick up the cup”), varying levels of specificity (“clean up” vs. “place the dishes in the sink”), and compositional instructions that combine multiple sub-tasks. Many open datasets have thin instruction coverage — a single template per task — which limits the language grounding of the resulting model.
3. Action Trajectories
The sequence of motor commands the robot should execute, synchronized frame-by-frame with the visual observations. Actions are typically represented as 7-dimensional vectors: 3D end-effector position (x, y, z), 3D orientation (roll, pitch, yaw), and gripper state (open/close). Some models use joint-space actions (6-7 joint angles) instead.
Temporal alignment is critical. Actions must be synchronized with visual observations at the control frequency of the target robot — typically 10-50 Hz for manipulation tasks, up to 120 Hz for GR00T N1's Diffusion Transformer. Misaligned timestamps, even by 50ms, can teach the model incorrect visuomotor correlations.
4. Scale Requirements
Scale requirements vary dramatically by use case:
- General-purpose pretraining: 500K–1M+ trajectories across many embodiments (OpenVLA used 970K)
- Task-specific fine-tuning: 50–50,000 demonstrations depending on task complexity and diversity
- LoRA fine-tuning: Can be effective with 5,000–50,000 examples at compute costs of $100–$5,000
- Cross-embodiment training: Reduces per-embodiment data needs by up to 90% but requires careful mixture weighting
Open Datasets for VLA Training
The open-source VLA ecosystem has coalesced around a handful of foundational datasets. Here is what is available, what each provides, and where the coverage ends.
| Dataset | Scale | Embodiments | Strengths | Limitations |
|---|---|---|---|---|
| Open X-Embodiment | 1M+ trajectories | 22 robot types | Largest cross-embodiment collection; RLDS format; standard for VLA pretraining | Lab environments only; uneven quality across sub-datasets; limited language instruction diversity |
| BridgeData V2 | 60,096 trajectories | WidowX-250 | 24 environments; diverse objects; well-curated; common fine-tuning benchmark | Single low-cost arm; tabletop only; no mobile manipulation or humanoid data |
| DROID | 76,000 trajectories | Franka Emika | 564 scenes; 86 tasks; high scene diversity for a single-embodiment dataset | So diverse that VLAs struggle to fit it (OpenVLA removed it from final training); Franka-only |
| RH20T | 110,000 episodes | 20 robot types | 147 tasks; strong task diversity; multi-embodiment | Collected primarily in Chinese labs; limited environment diversity outside research settings |
| Ego4D | 3,670 hours video | Human (wearable cameras) | Massive scale; diverse environments; 74 locations across 9 countries | No robot actions; requires retargeting to convert human demonstrations to robot control |
| Ego-Exo4D | Synchronized multi-view | Human (Aria glasses + external) | Paired ego + exo views; rich activity data; 15 university partners | Academic access only; no robot actions; focused on activity recognition rather than manipulation |
Where Open Datasets Fall Short
Open datasets are invaluable for research but consistently fail to meet the requirements of production VLA deployment. Here are the five recurring gaps.
Environment Diversity
Nearly all open robot datasets are collected in university labs with controlled lighting, clean surfaces, and limited object sets. Real deployment environments — homes, warehouses, hospitals, outdoor spaces — have uncontrolled lighting, clutter, reflective and transparent surfaces, and thousands of novel object categories. Models trained on lab data systematically fail when encountering these conditions.
Action Label Granularity
Most open datasets record end-effector poses at 10-30 Hz. Production VLAs operating at 50-120 Hz need higher temporal resolution. Additionally, many datasets lack gripper force data, contact events, and fine-grained manipulation phase labels (approach, pre-grasp, grasp, lift, transport, place) that are critical for dexterous tasks.
Language Instruction Quality
Open datasets typically use a single template instruction per task (e.g., 'pick up the blue block'). Production VLAs need rich instruction diversity: paraphrases, varying specificity levels, compositional multi-step instructions, and corrections. This language coverage is expensive to annotate and almost entirely missing from open data.
Enrichment Layers
Raw robot demonstrations lack the enrichment layers that accelerate VLA training: per-frame depth maps, semantic segmentation masks, hand-object interaction labels, and 3D scene reconstructions. Teams must build expensive enrichment pipelines in-house or work with providers like Claru that deliver pre-enriched data.
Licensing and Compliance
Many open datasets have restrictive academic licenses (non-commercial use only) or unclear provenance. Ego4D requires a 48-hour license approval. BridgeData V2's OXE version is outdated. Companies building commercial robotics products need data with clear commercial licenses and documented collection consent — something open datasets rarely guarantee.
Using Human Video for VLA Pretraining
A breakthrough research direction in 2025-2026 has been pretraining VLAs on egocentric human video, then fine-tuning on robot data. This dramatically reduces the amount of expensive robot demonstration data needed.
NVIDIA's EgoScale (2025) trained a VLA on over 20,000 hours of action-labeled egocentric human video — more than 20x larger than any prior effort. The results were striking: a log-linear scaling law between human data scale and validation loss, and a 54% improvement in average success rate over a no-pretraining baseline for a 22-DoF robotic hand performing dexterous manipulation.
EgoMimic from the same period showed that co-training on human hand demonstrations alongside robot data — using Project Aria glasses and a bimanual manipulator — consistently outperforms robot-only training. Their key finding: one hour of additional human data is more valuable than one hour of additional robot data.
These results have significant implications for VLA data strategy. Rather than collecting all data through expensive robot teleoperation, teams can:
- Pretrain on large-scale egocentric human video with action labels, depth maps, and pose estimation
- Fine-tune on targeted robot demonstrations collected on the deployment hardware
- Achieve better generalization with significantly less robot data
This is precisely where Claru's 500K+ enriched egocentric video clips become relevant. Each clip comes pre-enriched with the annotation layers VLA pretraining needs: depth maps, human pose estimation, semantic segmentation, and action labels.
Data Enrichment for VLA Pipelines
Raw video — whether from humans or robots — is not sufficient for VLA training. The models need structured annotation layers that provide supervisory signals beyond raw pixels.
Depth Maps
Depth Anything V2
Per-frame monocular depth estimation provides the 3D spatial understanding VLAs need to plan reach and grasp actions. Depth Anything V2, published at NeurIPS 2024, offers depth estimation from 25M to 1.3B parameters, and has been evaluated as a LiDAR alternative for robotic depth sensing with 89.1% of near-field errors within 0.5m.
Deep dive: enrichment pipeline →Pose Estimation
ViTPose / ViTPose++
2D and 3D human body and hand joint positions are critical for human-to-robot transfer learning. ViTPose achieves 81.1 AP on COCO with models scaling from 100M to 1B parameters, while ViTPose++ extends to animal and whole-body pose estimation via task-specific MoE heads.
Deep dive: enrichment pipeline →Semantic Segmentation
SAM 3
Object-level and part-level segmentation masks let VLAs understand scene structure and object affordances. SAM 3, published at ICLR 2026, extends promptable segmentation to concept-based prompts (short noun phrases or image exemplars) across images and video, unifying detection, segmentation, and tracking.
Deep dive: enrichment pipeline →Optical Flow
RAFT
Dense motion fields between consecutive frames provide explicit motion information that helps VLAs predict object dynamics and plan interaction trajectories. RAFT remains the standard optical flow backbone, with recent work combining RAFT outputs with SAM segmentation for motion-aware scene understanding.
Deep dive: enrichment pipeline →Action Labels
Human annotation + InternVideo2
Temporal action segmentation marks the boundaries between discrete manipulation phases: approach, pre-grasp, grasp, lift, transport, place. These labels are essential for training VLAs to decompose complex tasks into executable sub-actions. Automated systems like InternVideo2 provide initial labels, refined by expert human annotators.
Deep dive: enrichment pipeline →How Claru Fills the VLA Data Gaps
Claru is built specifically for the data needs of VLA and physical AI teams. Here is what we deliver that open datasets do not.
Real-World Environment Diversity
500K+ egocentric video clips captured across 100+ cities in kitchens, workshops, warehouses, retail spaces, and outdoor environments. Every clip is real-world, licensed, and consented — not lab footage, not synthetic, not scraped.
Pre-Enriched by Default
Every clip ships with depth maps (Depth Anything V2), pose estimation (ViTPose), semantic segmentation (SAM), optical flow, and AI-generated captions. No enrichment pipeline to build. No months of engineering time before training can start.
Expert Action Annotation
Human annotators label what automated systems miss: temporal action boundaries, object affordances, grasp types, intent labels, and natural language instruction paraphrases. 4M+ completed human annotations and growing.
Your Format, Your Timeline
Data delivered in RLDS, WebDataset, HDF5, Parquet, or custom formats — compatible with OpenVLA, Octo, LeRobot, and proprietary VLA pipelines. Brief to first delivery in days, not months. Direct S3/GCS delivery or API access.
Frequently Asked Questions
What is VLA training data?
VLA training data consists of synchronized triplets of visual observations (RGB images or video frames), natural language instructions (e.g., 'pick up the red cup'), and action trajectories (end-effector poses, joint angles, or gripper states). Each data point teaches a Vision-Language-Action model to map what the robot sees and the instruction it receives to the physical actions it should execute. The data is typically collected through human teleoperation of robots or from egocentric video of humans performing tasks, then converted to the robot's action space.
How much training data does a VLA model need?
The amount depends on the approach. Pretraining a general-purpose VLA like OpenVLA used approximately 970,000 trajectories from the Open X-Embodiment dataset across 22 different robot embodiments. Fine-tuning a pretrained VLA for a specific task can require as few as 50 to 200 demonstrations for simple pick-and-place tasks, or 5,000 to 50,000 demonstrations for complex dexterous manipulation. The key insight from 2025-2026 research is that data diversity matters more than raw volume: 1,000 demonstrations across 50 environments outperform 5,000 demonstrations in a single environment.
What is the Open X-Embodiment dataset?
Open X-Embodiment (OXE) is a collaborative dataset aggregating over one million robot manipulation trajectories from 22 different robot embodiments across 21 research institutions. Created by Google DeepMind and collaborators, it provides the largest open-source collection of cross-embodiment robot data in RLDS format. OpenVLA was pretrained on a curated subset called 'Magic Soup++' containing approximately 970,000 trajectories. OXE includes data from platforms like Franka Emika, UR5, Google Robot, and others, making it the standard pretraining dataset for VLA research.
What are the key open-source VLA datasets available?
The major open-source VLA datasets include Open X-Embodiment (1M+ trajectories, 22 embodiments), BridgeData V2 (60,096 trajectories on a WidowX-250 arm across 24 environments), DROID (76,000 trajectories from 564 scenes on Franka robots), and RH20T (110,000 episodes across 147 tasks from 20 robots). Academic datasets like Ego4D (3,670 hours of egocentric video) and Ego-Exo4D provide human demonstration data that can be used for VLA pretraining. However, all open datasets have significant gaps: limited environment diversity, narrow task coverage, and insufficient action label granularity for production deployment.
What is the difference between VLA and VLM models?
A Vision-Language Model (VLM) like GPT-4o or PaliGemma processes images and text to produce text outputs. It can describe what it sees and answer questions but cannot control a robot. A Vision-Language-Action (VLA) model extends a VLM by adding action generation: it takes the same visual and language inputs but outputs continuous robot actions (joint positions, end-effector poses, gripper commands) that directly control hardware. VLAs typically use a pretrained VLM as the backbone and add an action head, trained on robot demonstration data, that converts the VLM's representations into physical motor commands.
How does Claru provide VLA training data?
Claru provides VLA training data through three channels. First, egocentric video capture: 10,000+ contributors worldwide wear cameras during real-world tasks, producing first-person video that mirrors robot head-camera viewpoints. Each clip is enriched with depth maps, pose estimation, segmentation masks, and action labels. Second, managed teleoperation: Claru coordinates robot demonstration collection on client hardware with trained operators following structured task protocols. Third, custom annotation: Claru's annotators add natural language instruction labels, action boundary annotations, and object affordance labels that turn raw demonstrations into complete VLA training triplets. All data is delivered in RLDS, WebDataset, or custom formats.
What format should VLA training data be in?
The standard format for VLA training data is RLDS (Reinforcement Learning Datasets), which stores episodes as sequences of timesteps containing observations (images, proprioception), actions (joint positions, end-effector deltas), language instructions, and metadata. OpenVLA, Octo, and most open-source VLA frameworks expect RLDS format. Alternative formats include WebDataset (for streaming large-scale training), HDF5 (for dense numerical trajectories), and LeRobot's format (Parquet metadata with video files). Claru delivers in any of these formats and provides conversion utilities for custom pipeline integration.
Can human video be used to train VLA models?
Yes, and this is an active area of research in 2026. NVIDIA's EgoScale showed that pretraining on 20,000+ hours of egocentric human video improved downstream robot task success rates by 54% compared to training from scratch. EgoMimic demonstrated that co-training on human hand demonstrations alongside robot data improves manipulation performance, and that one hour of additional human data is more valuable than one hour of additional robot data. The key challenge is the embodiment gap: human hands have different kinematics than robot grippers, so the models must learn to transfer affordance and intent understanding rather than directly copying joint trajectories.
Related Resources
Ready to Build Your VLA Training Dataset?
Tell us what your model needs to learn. We'll scope the dataset, define collection and enrichment protocols, and deliver training-ready VLA data in your format.