Last updated: April 2026
How Much Training Data Does a VLA Model Need? (2026)
"How much data do we need?" is the first question every robotics team asks before commissioning a data collection effort. The answer is not a single number — it depends on whether you are pre-training or fine-tuning, and on the task complexity. Here are the actual figures from published research.
TL;DR
- OpenVLA pre-training required 970K robot manipulation trajectories from the Open X-Embodiment dataset — this is the reference scale for training a generalist VLA from scratch.
- Fine-tuning a pretrained VLA on simple pick-and-place typically needs 50–200 demonstrations; multi-step manipulation tasks need 1K–5K; dexterous bimanual tasks need 5K–50K.
- DROID (76K trajectories, 564 environments) and BridgeData V2 (~60K demonstrations) are the most environment-diverse public datasets for fine-tuning on novel real-world settings.
- Human egocentric video without action labels can supplement robot trajectory data for visual pretraining, but the action head still requires physically collected robot demonstrations.
Pre-training vs. Fine-tuning: Different Data Regimes
The question "how much data does a VLA need?" has two very different answers depending on which training phase you are asking about. Pre-training and fine-tuning operate in different data regimes, with different volume requirements and different data characteristics.
Pre-training refers to training the model's general robot manipulation capabilities before task-specific adaptation. This phase requires breadth: diverse robot embodiments, diverse environments, diverse task types, and large trajectory counts. OpenVLA's pre-training on 970K trajectories from Open X-Embodiment is the reference point for a 7B parameter model. Octo, at 93M parameters, also pre-trains on Open X-Embodiment but achieves good generalization with this same dataset at a fraction of the compute cost.
Fine-tuning adapts a pretrained VLA to a specific robot, environment, or task type. This phase requires depth rather than breadth: high-quality demonstrations of exactly the tasks you need the robot to perform, in the environments where it will be deployed, on the robot hardware you are using. Fine-tuning is dramatically more data-efficient than pre-training — the pretrained model already understands objects, spatial relationships, and manipulation primitives. You are only teaching it the specifics of your deployment context.
Most robotics teams in 2026 do not pre-train VLAs from scratch. They start from OpenVLA, Octo, or a proprietary pretrained VLA, then fine-tune on their own collected data. The pre-training datasets (Open X-Embodiment, BridgeData V2, DROID) are therefore shared infrastructure rather than something each team needs to independently collect.
Pre-training Datasets: Open X-Embodiment, DROID, BridgeData V2
Open X-Embodiment (1M+ trajectories)
Open X-Embodiment is the largest publicly available robot manipulation dataset. It aggregates trajectories from 22 different robot embodiments — including RT-1 robots (Google), Franka Emika Panda arms (multiple institutions), WidowX arms (Berkeley), and others — across 21 research institutions. The total dataset contains over 1 million trajectories. Each trajectory includes synchronized RGB observations (wrist and/or overhead camera), natural language instruction annotations, and action sequences recorded at 1–10 Hz depending on the data source.
OpenVLA used a curated 970K-trajectory subset of Open X-Embodiment, filtering for data quality and instruction coverage. The curation step matters: not all trajectories in the full dataset are equal in quality, and naively training on the full set can introduce noise from poorly annotated or corrupted demonstrations.
DROID (76K trajectories, 564 environments)
DROID (Distributed Robot Interaction Dataset) prioritizes environment diversity over embodiment diversity. All 76K trajectories use a standardized Franka Emika Panda arm with consistent wrist and overhead cameras. The value of DROID is the 564 distinct physical environments: real kitchens, offices, labs, workshops, and storage spaces — each with different lighting, backgrounds, object arrangements, and clutter levels. This makes DROID particularly useful for training models that need to generalize to novel deployment environments, even if those environments are not in the training set.
BridgeData V2 (~60K demonstrations)
BridgeData V2 from UC Berkeley contains approximately 60,000 demonstrations of tabletop manipulation tasks performed by a WidowX robot arm. Unlike Open X-Embodiment (which aggregates many sources), BridgeData V2 was collected with a consistent protocol, making the data distribution tighter. It is widely used for fine-tuning experiments because of its clean data collection methodology. Tasks include pick-and-place, stacking, pouring, and various kitchen manipulation scenarios. The dataset is annotated with natural language instructions and task labels.
Data Volume by Task Type
Ranges derived from published results across OpenVLA, Octo, pi-zero, ALOHA, RoboAgent, and GR00T N1 papers. Numbers assume fine-tuning from a pretrained VLA, not training from scratch.
| Task Type | Demo Volume Range |
|---|---|
| VLA pre-training (from scratch) | 500K – 1M+ trajectories |
| Fine-tune: single pick-and-place (fixed object, fixed env) | 50 – 200 demos |
| Fine-tune: pick-and-place (multiple objects + positions) | 200 – 1,000 demos |
| Fine-tune: multi-step manipulation (stack, sort, pour) | 1,000 – 5,000 demos |
| Dexterous bimanual manipulation (fold, assemble, pack) | 5,000 – 50,000 demos |
| Mobile manipulation (navigation + manipulation) | 10,000 – 100,000 demos |
| Humanoid whole-body control | 50,000 – 500,000 demos |
Note: Numbers represent demonstration requirements for achieving meaningful task success rates (>50%) in controlled lab settings. Production deployment in unstructured environments typically requires 10–100× more data for reliable generalization.
What Actually Drives Data Volume Requirements
The data volume ranges in the table above are not arbitrary — they reflect specific properties of the learning problem. Understanding what drives volume requirements helps teams scope collection efforts more accurately.
1. Task variation
How many distinct object types, positions, orientations, and lighting conditions does the task require? A model that needs to pick up any cup from any position in any kitchen needs orders of magnitude more data than one that picks up a specific cup from a fixed position in a fixed lab. Task variation is the most significant driver of data volume.
2. Contact complexity
Tasks that require precise contact control — folding fabric, inserting connectors, screwing caps — have much narrower error tolerances than simple pick-and-place. More demonstrations are needed to cover the distribution of contacts the model might encounter. This is why pi-zero's dexterous manipulation tasks (laundry folding, dish loading) required far more data than the manipulation tasks in BridgeData V2.
3. Distribution shift from pretraining
If your target task and environment are well-represented in the pretraining dataset, you need far fewer fine-tuning demonstrations. If your robot, environment, or task type is substantially different from the pretraining distribution — unusual lighting, non-standard objects, different camera setup — you will need more demonstrations to compensate for the distribution shift.
4. Demonstration quality
Clean, consistent demonstrations collected with a systematic protocol are worth more than the same number of noisy demonstrations collected opportunistically. BridgeData V2's tighter data collection protocol is part of why it has been used for so many published fine-tuning experiments — the data quality is reliable enough to isolate algorithmic variables.
Supplementing with Egocentric Human Video
Robot teleoperation is expensive and slow. Collecting 50,000 demonstrations of dexterous bimanual manipulation requires operators, hardware, and weeks of collection time. One approach that has shown promise is supplementing robot trajectory data with human egocentric video — first-person footage captured by humans performing the same tasks without robot hardware.
Human egocentric video does not contain robot action labels (joint angles, end-effector poses). What it contains is rich visual information about how humans manipulate objects: grasp types, hand-object contact patterns, approach trajectories, and task sequencing. This visual information transfers to robot learning through two mechanisms:
- Visual pretraining: Using egocentric video to pretrain the perception backbone before action head training. The model learns object representations and manipulation visual patterns from video, then the action head trains on the smaller robot teleoperation dataset.
- Co-training: Mixing egocentric video (with video-level labels or captions, but without robot actions) alongside robot trajectory data during VLA fine-tuning. EgoMimic demonstrated that this co-training consistently improves task success rates compared to robot-only training.
Claru's 500K+ egocentric clips — captured across 100+ cities, covering kitchen, workshop, warehouse, and outdoor manipulation scenarios — are specifically structured for this use case. Each clip includes depth maps, pose estimation, semantic segmentation, and action boundary labels that make them compatible with visual pretraining pipelines for VLA development.
The practical takeaway: if your team is facing a data volume problem for a manipulation task, before scaling up expensive robot teleoperation, assess whether egocentric human video of the same tasks could substitute for some of that teleoperation data in the visual pretraining phase. For tasks involving everyday objects and environments, the answer is usually yes.
Key Takeaways
- Pre-training a generalist VLA requires hundreds of thousands of trajectories — OpenVLA's 970K-trajectory Open X-Embodiment subset is the current reference scale for 7B parameter models.
- Fine-tuning from a pretrained VLA is dramatically more data-efficient: 50–200 demos for simple pick-and-place, 1K–5K for multi-step manipulation, 5K–50K for dexterous bimanual tasks.
- DROID (76K trajectories, 564 environments) is the best public option for fine-tuning on novel real-world environments; BridgeData V2 (~60K demos) is the cleanest single-robot dataset for methodology experiments.
- Data volume requirements are driven primarily by task variation, contact complexity, distribution shift from pretraining, and demonstration quality — not trajectory count alone.
- Human egocentric video can substitute for robot teleoperation data in the visual pretraining phase, reducing the total robot teleoperation budget needed for a new task.
- Production deployment requirements are typically 10–100× higher than the numbers reported in controlled lab benchmarks — account for long-tail scenarios, edge cases, and distributional shift in real environments.
Frequently Asked Questions
How many robot demonstrations do I need to train a VLA?
It depends on whether you are pre-training or fine-tuning, and on task complexity. Pre-training a VLA from scratch requires hundreds of thousands to millions of trajectories — OpenVLA used 970K trajectories from Open X-Embodiment. Fine-tuning a pretrained VLA on a specific task is far more data-efficient: simple pick-and-place in a single environment requires 50–200 demonstrations; complex manipulation with multiple objects and configurations needs 500–5,000 demonstrations; dexterous bimanual tasks like folding or assembly may need 5,000–50,000 demonstrations. These numbers assume the fine-tuning task is semantically covered by the base model's training distribution.
Can I fine-tune OpenVLA with 100 demos?
Yes, for simple tasks. The OpenVLA paper reports successful fine-tuning on single-object pick-and-place with 100–200 demonstrations when the task environment is not drastically different from the Open X-Embodiment distribution. Performance degrades for tasks with unusual object shapes, non-standard lighting, or manipulation types (like pouring or folding) that are underrepresented in Open X-Embodiment. For tasks outside OpenVLA's training distribution, 100 demos will typically be insufficient, and you will need either more demonstrations or a pretrained VLA that better covers your task domain.
What is the minimum dataset size for robot learning?
The minimum viable dataset size is task-dependent, not a fixed number. For imitation learning on a single primitive task (reach and grasp a specific object in a fixed position), as few as 10–20 demonstrations can achieve measurable success rates in controlled conditions. For generalizing across object positions, this rises to 50–200. For generalizing across object types and environments, 500–5,000 demonstrations is the practical floor. Models that need to generalize across diverse environments, multiple tasks, and varied lighting and clutter conditions require tens to hundreds of thousands of trajectories — which is why Open X-Embodiment, BridgeData V2, and DROID exist as shared community resources.
What datasets were used to train OpenVLA?
OpenVLA was pre-trained on a curated subset of the Open X-Embodiment dataset: 970,000 robot manipulation trajectories across 22 different robot embodiments from 21 research institutions. The dataset includes contributions from RT-1 (Google), BridgeData V2 (UC Berkeley), TACO-Play (Karlsruhe Institute of Technology), Language Table (Google), and 17 other robotics research groups. The training data covers tabletop manipulation, mobile manipulation, and some dexterous manipulation, with natural language instruction annotations for each trajectory.
How does DROID differ from Open X-Embodiment for VLA training?
DROID (Distributed Robot Interaction Dataset) provides 76,000 trajectories collected across 564 distinct environments by 50+ data collectors using a standardized Franka Emika Panda robot arm. Unlike Open X-Embodiment, which aggregates data from many different robot types and collection setups, DROID prioritizes environment diversity over robot diversity. The 564 environments include diverse real-world locations — kitchens, offices, labs, workshops — which makes DROID particularly useful for training models that need to generalize to novel environments rather than novel robot embodiments. DROID also uses standardized wrist and overhead cameras, making the visual distribution more consistent across trajectories.
Related Resources
VLA Training Data: The Complete Guide
Architecture overview, open datasets, data gaps, and how teams source VLA training data.
VLA Training Data — Claru
How Claru collects, enriches, and delivers data for VLA model development.
Glossary: Manipulation Trajectory
Definition of manipulation trajectories and their role in VLA training.