Physical AI Training Data Guide 2026

Q: How much training data does a robot manipulation policy need?

Google DeepMind's RT-2 used approximately 130K real-world episodes to generalize across 700+ language-conditioned manipulation instructions, as reported in Brohan et al. ([arXiv:2307.15818](https://arxiv.org/abs/2307.15818)). For single-task behavior cloning on a 6-DoF arm (e.g., picking a known object from a bin), 1K–5K high-quality teleoperated demonstrations typically achieve >80% success rates.

Q: What sensor data is most important for robot learning?

RGB video is the universal modality consumed by every major [VLA architecture](/blog/vlm-vs-vla), but force/torque sensing is the most impactful underserved modality. Fewer than 5% of datasets in the Open X-Embodiment collection include force/torque signals, according to the Open X-Embodiment Collaboration ([arXiv:2310.08864](https://arxiv.org/abs/2310.08864)), yet contact-rich tasks like insertion, cable routing, and tool use cannot succeed without this data.

Q: Is simulation data sufficient for training physical AI systems?

Simulation data alone is insufficient for manipulation and autonomous driving, though it works well for locomotion. Chen et al. ([arXiv:2510.21391](https://arxiv.org/abs/2510.21391v1)) found that visual appearance mismatch (not dynamics mismatch) accounts for 40–60% of sim-to-real transfer failures in manipulation tasks.

TL;DR
Google DeepMind's Open X-Embodiment dataset aggregates over 1M real robot episodes across 22 robot types, but fewer than 5% of contributing datasets include force/torque annotations. Dexterous manipulation training suffers directly.
The Waymo Open Dataset contains 1,150 driving scenes with 12M 3D bounding box labels. Getting comparable annotation density for indoor manipulation would cost roughly 40–80× more per frame because of occlusion and deformable-object complexity.
Mandlekar et al. (arXiv:1907.02664) showed that filtering human teleoperation data to the top 25% of demonstrations by a multi-attribute quality score improved imitation learning success rates by 30–50% on contact-rich tasks. Raw episode count is not enough.
Physical AI datasets degrade through five distinct failure modes (sensor dropout, kinematic aliasing, label drift, distribution collapse, sim-real texture mismatch). This post gives a taxonomy with detection heuristics for each.

In this post

Data requirements by robot domain

Modality specs and sensor coverage

Collection-to-annotation pipeline benchmarks

Failure mode taxonomy

Dataset quality criteria for vendor evaluation

Key takeaways

Related resources

Data requirements by robot domain

Physical AI training data is not fungible across robot types. A highway driving dataset is structurally useless for bimanual manipulation because the action spaces, sensor suites, temporal resolutions, and failure distributions differ fundamentally between domains. This section gives the minimum viable data specs you need when scoping a new physical AI training run for any robot embodiment.

Manipulation (6/7-DoF arms, dexterous hands)

Manipulation policies trained via behavior cloning or diffusion-based imitation learning (e.g., Diffusion Policy) need dense, contact-rich demonstrations. Google DeepMind's RT-2 trained on approximately 130K real episodes to generalize across 700+ language-conditioned instructions, as reported in Brohan et al. (arXiv:2307.15818). The DROID dataset collected 76K episodes across 564 scenes and 86 tasks, using bimanual Franka arms with wrist-mounted cameras, joint states at 15 Hz, and RGB at 30 Hz.

What you need:

Action space: Joint positions or end-effector poses at ≥15 Hz. For dexterous hands like the Allegro Hand (16 DoF), 20–50 Hz is typical.

Visual inputs: At minimum, one third-person RGB camera + one wrist camera. Stereo or depth is required for transparent/reflective object handling.

Proprioception: Joint angles, velocities, and gripper force/torque. Fewer than 5% of Open X-Embodiment constituent datasets include F/T signals, according to the Open X-Embodiment Collaboration (arXiv:2310.08864).

Episode volume: 1K–10K demonstrations per task for behavior cloning; 50K+ for multi-task generalization. Scale depends on data quality (see quality filtering below).

Annotations: Task success/failure labels, grasp-type tags, contact event timestamps, and object 6-DoF poses when available.

Locomotion (quadrupeds, humanoids)

Locomotion policies for platforms like Boston Dynamics Spot or humanoid robots from Figure AI and Agility Robotics are mostly trained in simulation (NVIDIA Isaac Gym, MuJoCo) and transferred via sim-to-real methods. Real-world data plays a different role here: it calibrates dynamics models and provides terrain/surface priors for sim-to-real transfer.

Sensor suite: IMU (≥200 Hz), joint encoders (≥100 Hz), foot contact binary signals, and optionally LiDAR or depth cameras for terrain mapping.

Real data volume: Typically 10–100 hours of logged locomotion for system identification and domain adaptation — orders of magnitude less than manipulation needs.

Annotations that matter: Terrain labels (concrete, gravel, grass, stairs), fall/recovery events, and slip detection ground truth.

Autonomous vehicles

Autonomous vehicles represent the most data-mature domain in physical AI. The Waymo Open Dataset provides 1,150 scenes, each 20 seconds, with 12.6M 3D bounding boxes, 11.8M 2D bounding boxes, and LiDAR point clouds at 10 Hz, as documented in Sun et al. (arXiv:1912.04838). The nuScenes dataset from Motional offers 1,000 scenes with full 360° camera coverage, radar, and LiDAR.

Sensor suite: 5–8 cameras (surround view), 1–5 LiDARs (64–128 beam), radar, GPS/INS at ≥10 Hz.

Data volume: Waymo reportedly logs approximately 25 million miles per year of real driving data. Filtering to the training-relevant long tail (near-misses, unusual agents, adverse weather) is the main engineering problem.

Annotations: 3D bounding boxes, lane segmentation, HD map alignment, trajectory prediction ground truth, and scenario-level behavior tags.

Domain	Min sensor Hz	Typical episode count	Biggest missing modality	Real data role
Manipulation (6-DoF arm)	15–30 Hz	10K–130K+	Force/torque	Primary training signal
Manipulation (dexterous hand)	20–50 Hz	5K–50K	Tactile	Primary + sim augmentation
Quadruped locomotion	100–200 Hz	10–100 hrs	Terrain labels	Sim-to-real calibration
Humanoid locomotion	100–500 Hz	10–50 hrs	Full-body contact	Sim-to-real calibration
Autonomous vehicles	10–20 Hz	1M+ miles	Rare-event scenarios	Primary training signal

Manipulation has the worst data efficiency of any physical AI domain: the ratio of useful training signal per dollar of data collected is roughly 40–80× worse than AV data because manipulation involves higher occlusion rates, deformable objects, and contact events that cameras alone cannot capture.

Modality specs and sensor coverage

Sensor modality selection determines training data quality more than volume or resolution. Chen et al. (arXiv:2510.21391) surveyed the physical AI stack and found that the gap between simulation visual fidelity and real-world RGB is the single largest contributor to sim-to-real transfer failures, ahead of dynamics mismatch. This finding changes how you should prioritize data collection spend.

RGB video is the universal modality. Every major VLA architecture consumes it. The minimum spec for manipulation is 640×480 at 30 Hz with <5ms synchronization to action labels. For AV, 1920×1080 or higher at 10–20 Hz is standard.

Depth matters most for transparent/reflective objects and precision placement. Intel RealSense D435 or equivalent at ≥15 Hz is the floor. Structured light sensors fail outdoors; time-of-flight sensors fail at close range (<15cm). Choose based on your deployment environment.

Proprioception (joint states, velocities) is underrated as a training modality. The Open X-Embodiment Collaboration (arXiv:2310.08864) showed that policies trained with proprioceptive inputs generalize across robot morphologies 15–20% better than vision-only policies.

Tactile sensing is the biggest missing modality in physical AI training data. Fewer than 2% of publicly available manipulation datasets include any form of tactile data. DIGIT and GelSight sensors exist but produce high-bandwidth, hard-to-annotate signals. Whoever solves tactile data at scale will unlock a step change in dexterous manipulation.

Egocentric video (head-mounted or wrist-mounted camera footage capturing the operator's perspective) is increasingly used for manipulation pretraining. It preserves the viewpoint and hand-object spatial relationships that third-person video loses.

The most impactful investment for multi-modal training data is camera-action synchronization accuracy, not sensor resolution. A perfectly synchronized 480p dataset produces better behavior cloning policies than a poorly synchronized 4K dataset because timing errors of 10–50ms corrupt the action labels that policies learn from.

Collection-to-annotation pipeline benchmarks

Physical AI data collection costs are dominated by annotation and quality control, not raw episode capture. Here are realistic collection-to-annotation timelines based on published benchmarks and industry reports.

Teleoperation collection (manipulation): A skilled operator using a SpaceMouse or bilateral teleoperation rig generates roughly 100–200 usable episodes per 8-hour shift for tabletop pick-and-place tasks. For contact-rich tasks (e.g., inserting a USB cable), throughput drops to 30–60 episodes per shift. The DROID project averaged approximately 120 episodes per operator per day across distributed collection sites.

Annotation throughput: 3D bounding box annotation in LiDAR point clouds (AV domain) takes 5–15 minutes per frame at production quality, according to Scale AI benchmarks. Manipulation task-success labeling is faster (approximately 2 seconds per episode for binary success/failure), but semantic annotations (grasp type, contact events, failure reason) take 1–3 minutes per episode.

Quality verification: Mandlekar et al. (arXiv:1907.02664) introduced a multi-attribute quality scoring system for human demonstrations in their RoboTurk platform. They scored demonstrations on task completion, trajectory smoothness, and efficiency. Filtering to the top 25% improved downstream imitation learning success rates by 30–50% on contact-rich tasks in simulation. This is the strongest published evidence that data curation dominates data volume for manipulation training.

The implication for pipeline design: you need a quality gate built into collection, not bolted on after. Claru's data enrichment pipeline addresses this by annotating each collected manipulation episode with 18+ metadata fields — including grasp taxonomy labels, object material properties, lighting condition tags, and contact event timestamps — structured for filtering and quality scoring before training begins. As the Mandlekar et al. RoboTurk results demonstrate, a well-curated 25K-episode dataset can outperform an unfiltered 100K-episode dataset.

End-to-end timeline: For a new manipulation task, expect 2–4 weeks from environment setup to first usable training batch (assuming existing robot infrastructure). AV data pipelines are more mature: Motional and Cruise have reported 48-hour turnaround from drive to labeled training data for known annotation schemas.

The bottleneck in physical AI data pipelines is never raw collection. It is annotation quality control and edge-case coverage.

Failure mode taxonomy

Physical AI training data degrades through five distinct, measurable failure modes. Each has a specific detection heuristic and an acceptable threshold.

1. Sensor dropout is the intermittent loss of camera frames, depth readings, or force signals during data collection. Detection: check for >2% NaN or zero-value entries in any sensor channel per episode. Cause: USB bandwidth saturation or ROS message queue overflow. Impact: policies learn discontinuous state representations that produce jerky or unsafe behaviors at deployment.

2. Kinematic aliasing occurs when multiple distinct physical states map to identical observation vectors. A gripper holding an object versus pressing against a surface produces the same joint-angle snapshot without force/torque data. Detection: cluster observation-action pairs and flag clusters with bimodal action distributions. Impact: behavior cloning learns averaged (and therefore incorrect) actions at aliased states.

3. Label drift is the gradual shift of annotation semantics across labelers or time. Mandlekar et al. (arXiv:1907.02664) observed inter-annotator disagreement rates of 12–18% on task success labels for multi-step manipulation tasks. Detection: compute pairwise annotator agreement (Cohen's kappa) per task and flag if κ < 0.8. Impact: noisy labels degrade policy performance proportionally to the disagreement rate.

4. Distribution collapse is the over-representation of easy configurations in collected data. Pinto and Gupta (arXiv:2005.07866) found that real-world grasp datasets systematically under-represent thin, flat, and deformable objects by 3–5× relative to household object frequency distributions. Detection: compute entropy over object categories, pose bins, and scene configurations, then compare to your target distribution. Impact: policies fail silently on under-represented object classes.

5. Sim-real texture mismatch is the visual appearance gap between synthetic training data and real-world deployment environments. Chen et al. (arXiv:2510.21391) found that this visual gap (as distinct from dynamics mismatch) accounts for 40–60% of sim-to-real transfer failure in manipulation. Detection: train a domain classifier on real versus simulated images — if classification accuracy exceeds 90%, your simulated data is not visually close enough to real-world conditions.

Failure mode	Detection metric	Acceptable threshold	Typical prevalence
Sensor dropout	% NaN/zero per channel	<2% per episode	5–15% of raw data
Kinematic aliasing	Action bimodality in obs clusters	No bimodal clusters >1% of data	Domain-dependent
Label drift	Inter-annotator Cohen's κ	≥0.8	12–18% disagreement
Distribution collapse	Category/pose entropy ratio	≥0.7 vs. target distribution	Very common
Sim-real texture mismatch	Domain classifier accuracy	<75% (real vs. sim)	Universal in sim data

All five failure modes are fixable, but only if you measure for them during collection. Teams that discover data quality failures at evaluation time rather than collection time lose 2–6 weeks per training iteration to re-collection and re-annotation.

Dataset quality criteria for vendor evaluation

Seven criteria determine whether a physical AI training dataset will produce reliable policies, ranked by impact on downstream policy performance. Use these whether you are evaluating an external data provider or auditing an in-house pipeline.

1. Action-observation synchronization accuracy must be <10ms for manipulation and <5ms for AV. Ask vendors for their synchronization method — hardware trigger versus software timestamp. Software-only synchronization introduces 10–50ms jitter that corrupts behavior cloning by misaligning the actions a policy learns with the observations that triggered them.

2. Demonstration quality filtering: The vendor should apply quality scoring per the Mandlekar et al. (arXiv:1907.02664) framework or equivalent multi-attribute approach. If they deliver raw, unfiltered episodes, budget 20–30% of your annotation spend on post-hoc quality filtering.

3. Diversity metrics: Request entropy statistics over scene configurations, object categories, lighting conditions, and operator identity. A dataset with 50K episodes from one operator in one room is less valuable than 10K episodes across 10 operators and 20 environments, because single-environment datasets produce policies that overfit to specific visual backgrounds and object arrangements.

4. Failure case inclusion rate: Datasets should include 15–30% failure demonstrations. Pure-success datasets produce brittle policies that cannot recover from perturbations. Toyota Research Institute has published on the value of including recovery behaviors in manipulation training sets.

5. Metadata completeness: Every episode should include robot URDF version, camera intrinsics/extrinsics, control mode (joint vs. Cartesian), success/failure label, and environment description. Incomplete metadata makes dataset combination and transfer learning across embodiments impossible. The Open X-Embodiment Collaboration (arXiv:2310.08864) learned this the hard way, requiring months of retroactive metadata standardization to merge heterogeneous datasets.

6. Annotation audit trail: The vendor should provide inter-annotator agreement scores. If they cannot show Cohen's κ ≥ 0.8 per annotation type, assume label noise exceeds 15%, which directly degrades policy performance.

7. Licensing and reproduction rights: For any production deployment, verify that the data license covers model weights derived from the data, not just research use. Multiple robotics startups have been caught off-guard by restrictive research-only licenses when attempting commercial deployment.

No public vendor currently scores above 80% on all seven criteria for manipulation data. This gap is why organizations like Claru exist: to provide embodied AI datasets built to engineering-grade specifications from the start, rather than retrofitting research-grade collections for production use.

Key takeaways

Google DeepMind's RT-2 required approximately 130K real-world manipulation episodes for multi-task generalization across 700+ instructions, as reported in Brohan et al. (arXiv:2307.15818). That is the concrete volume benchmark for 6-DoF arm policies.

Mandlekar et al. (arXiv:1907.02664) showed that filtering teleoperation demonstrations to the top 25% by quality score improved imitation learning success rates by 30–50% on contact-rich tasks.

Pinto and Gupta (arXiv:2005.07866) found that real-world grasp datasets under-represent thin, flat, and deformable objects by 3–5× relative to household frequency distributions.

Chen et al. (arXiv:2510.21391) identified visual appearance mismatch (not dynamics mismatch) as the dominant sim-to-real transfer failure mode, accounting for 40–60% of performance degradation in manipulation.

Fewer than 5% of datasets in the Open X-Embodiment collection include force/torque annotations, according to the Open X-Embodiment Collaboration (arXiv:2310.08864).

Action-observation synchronization below 10ms matters more than camera resolution or frame rate for manipulation data quality.

Physical AI training data degrades through five distinct, measurable failure modes. Teams that monitor for all five during collection (not after) save 2–6 weeks per training iteration.

FAQ

How much training data does a robot manipulation policy need?

Google DeepMind's RT-2 used approximately 130K real-world episodes to generalize across 700+ language-conditioned manipulation instructions, as reported in Brohan et al. (arXiv:2307.15818). For single-task behavior cloning on a 6-DoF arm (e.g., picking a known object from a bin), 1K–5K high-quality teleoperated demonstrations typically achieve >80% success rates.

However, raw episode count is less important than episode quality. Mandlekar et al. (arXiv:1907.02664) demonstrated that quality-filtered datasets at 25% of original size can match or exceed the performance of unfiltered full-size datasets on contact-rich tasks. The practical lesson: invest in quality scoring infrastructure before scaling collection volume. 10K curated episodes often outperform 50K unfiltered ones.

What are the biggest failure modes in physical AI training data?

Physical AI training data degrades through five primary failure modes, each with a specific detection method. Sensor dropout (intermittent loss of camera or proprioceptive signals) typically affects 5–15% of raw collected episodes. Kinematic aliasing occurs when distinct physical states produce identical observations, especially without force/torque sensing. Label drift causes inter-annotator disagreement of 12–18% on task success labels, as measured by Mandlekar et al. (arXiv:1907.02664). Distribution collapse means systematic under-representation of object categories or poses by 3–5×, according to Pinto and Gupta (arXiv:2005.07866). Sim-real texture mismatch is responsible for 40–60% of sim-to-real transfer failure, according to Chen et al. (arXiv:2510.21391).

Each failure mode has a specific detection heuristic. Monitor all five during data collection, not at evaluation time.

How do I evaluate a physical AI training data vendor?

Seven criteria determine vendor quality for physical AI training data, ranked by impact on downstream policy performance. First, action-observation synchronization accuracy must be <10ms for manipulation — ask whether the vendor uses hardware triggers or software timestamps. Second, the vendor should apply demonstration quality filtering using a multi-attribute scoring framework similar to Mandlekar et al. (arXiv:1907.02664). Third, request diversity metrics showing entropy over scenes, objects, lighting, and operators. Fourth, datasets should include 15–30% failure demonstrations for training recovery behaviors. Fifth, verify metadata completeness including robot URDF version, camera calibration, and control mode. Sixth, inter-annotator agreement scores should show Cohen's κ ≥ 0.8. Seventh, confirm licensing terms explicitly cover model weights derived from the data.

No public vendor currently scores above 80% on all seven criteria for manipulation data. Request sample data and run your own quality audit before committing.

What sensor data is most important for robot learning?

RGB video is the universal modality consumed by every major VLA architecture, but force/torque sensing is the most impactful underserved modality. Fewer than 5% of datasets in the Open X-Embodiment collection include force/torque signals, according to the Open X-Embodiment Collaboration (arXiv:2310.08864), yet contact-rich tasks like insertion, cable routing, and tool use cannot succeed without this data.

For camera data, synchronization accuracy matters more than resolution: a perfectly synchronized 640×480 dataset is more valuable for behavior cloning than a poorly synchronized 4K dataset. Proprioceptive data (joint angles and velocities) improved cross-embodiment generalization by 15–20% in Open X-Embodiment experiments. Tactile sensing (DIGIT, GelSight) has the highest potential upside but the least infrastructure support — fewer than 2% of public manipulation datasets include any tactile data.

Is simulation data sufficient for training physical AI systems?

Simulation data alone is insufficient for manipulation and autonomous driving, though it works well for locomotion. Chen et al. (arXiv:2510.21391) found that visual appearance mismatch (not dynamics mismatch) accounts for 40–60% of sim-to-real transfer failures in manipulation tasks.

For locomotion, simulation-only training with sim-to-real transfer is the standard approach — quadrupeds and humanoids are trained in NVIDIA Isaac Gym and MuJoCo, then transferred using domain randomization. For AV, simulation handles rare-event augmentation, but real driving data remains the primary training signal (Waymo logs approximately 25 million miles per year of real data).

The most effective current approach is hybrid: pretrain on simulation or egocentric video data, then fine-tune on domain-specific real-world demonstrations. Pure-simulation training without real data fine-tuning has not achieved production-grade reliability for manipulation tasks as of 2026.

Related resources

Physical AI Training Data — Overview of data types and collection approaches for embodied AI systems

Training Data for Robotics — Claru's approach to engineering-grade robot training datasets

VLM vs. VLA: Architecture Comparison — How training data requirements differ between vision-language models and vision-language-action models

Glossary: Physical AI Terms — Definitions of terms including sim-to-real transfer, behavior cloning, domain randomization, and training data

Egocentric Video Datasets — How first-person video data is used for manipulation pretraining

Sim-to-Real Gap — Technical deep dive into transfer failure modes and mitigation strategies