The bottleneck in robotics has moved from model architecture to training data. Architecture search is commoditized: diffusion policies, VLAs, and transformer-based visuomotor policies all work when you feed them enough good trajectories. The constraint is data.
The Open X-Embodiment collaboration (a consortium including Google DeepMind, Stanford, UC Berkeley, and 20+ other institutions) aggregated over 1 million episodes from 22 robot morphologies. Their main finding, published in the 2024 Open X-Embodiment paper, was that data diversity—not model size—drove cross-embodiment transfer. Separately, the BridgeData V2 dataset from UC Berkeley reached 60,562 demonstrations across 13 skills and 24 environments. Even at that scale, the BridgeData V2 authors reported scene and object diversity as the binding limitation on generalization.
If you are building or buying a training data program for robotics, the question is not "how many demos" but "what does each demo contain, and how do you know it is correct." Policy performance, sim-to-real transfer, safety validation: all of it depends on decisions made in the first five stages of the pipeline described below.
Stage 1: Hardware setup and sensor calibration
Every robotics training data pipeline is defined by what the sensors can see and what the actuators can record. The minimum spec for a modern manipulation dataset:
| Component | Production Spec | Why It Matters |
|---|---|---|
| RGB cameras | ≥2 views (wrist + third-person), ≥30 fps, synchronized to ≤5 ms | Single-view policies lose 15–25% success rate vs. multi-view according to the DROID benchmark evaluations |
| Depth / pointcloud | Structured-light or ToF, registered to RGB, ≤2 mm noise at 0.5 m | Ke et al. in the GenDP paper (arXiv:2412.13877) require metric 3D input for their 3D diffusion policy |
| Joint encoders | ≥200 Hz proprioceptive logging, absolute position | Action-chunking policies (ACT, Diffusion Policy) need 50–100 Hz action labels; the raw encoder rate must be higher to avoid aliasing |
| Force/torque | 6-axis F/T at wrist, ≥100 Hz | Contact-rich tasks (insertion, wiping) need F/T to label success; without it, you rely on visual heuristics that miss subtle failures |
| Extrinsic calibration | Reprojection error ≤1.0 px across all cameras, re-verified every 500 episodes | Drift in camera-to-base transforms corrupts pointcloud registration silently |
Calibration drift is the failure mode nobody budgets for in large-scale training data for robotics collection. The DROID dataset maintainers at Stanford and Toyota Research Institute found that over multi-week campaigns, small bumps to camera mounts introduced 3–5 mm registration errors that degraded policy learning. The fix is unglamorous: bolt mounts to an optical breadboard, re-run hand-eye calibration on a fixed schedule, and log the reprojection error as metadata per episode.
Claru operates dedicated data collection cells with fixed optical-bench camera mounts and automated calibration verification before each shift. This infrastructure detail determines whether 10,000 episodes are usable or need re-collection.
Stage 2: Human operator protocols
Operator skill variance is the single largest noise source in behavioral cloning datasets, according to research from Mandlekar et al. at Stanford. Scaling demonstrations requires human operators, and controlling that variance is what separates production-grade data from research artifacts.
Mandlekar et al. in the RoboTurk study (arXiv:1909.12200) quantified this directly: across 2,200+ crowdsourced demonstrations, operator skill level explained more variance in downstream policy performance than dataset size did. Their "better operator" subset (top quartile by task completion speed and consistency) yielded policies with 2× the success rate compared to policies trained on the full unfiltered pool. The same RoboTurk study also showed that network latency above 150 ms degraded demonstration quality to the point where the resulting trajectories hurt policy training rather than helped.
The practical operator protocol checklist:
- Qualification task. Every operator completes a standardized evaluation (e.g., pick-and-place 20 objects across 3 workspace locations) before contributing to the production dataset. Record success rate and mean completion time.
- Teleoperation interface. Bilateral teleoperation (leader-follower) outperforms phone/VR interfaces for dexterous tasks. For the DexMV pipeline (arXiv:2104.08521), the UC San Diego team found that markerless hand tracking combined with retargeting to the Allegro Hand required explicit per-morphology calibration to avoid joint-limit violations.
- Session limits. Operator fatigue degrades demonstration quality after 60–90 minutes of continuous teleoperation. Mandate breaks. Log operator ID and session number per episode so you can detect quality degradation post hoc.
- Language annotation at collection time. If you are training language-conditioned policies like VLAs, the operator or a paired annotator should dictate the task instruction during or immediately after the episode. Retrospective captioning is cheaper but measurably noisier.
- Intervention protocol. Define when an operator should abort vs. retry. Failed grasps that are "recovered" mid-episode create bimodal action distributions that confuse behavioral cloning.
No public lab has published a complete, reusable operator qualification spec. If you are evaluating a data vendor, ask for their operator qualification pass rate and whether they track per-operator downstream policy metrics. Most cannot answer.
Stage 3: Annotation schema design
Annotation schema design determines whether raw trajectories become usable training data for robotics policies. Raw trajectories are necessary but not sufficient—modern policies require structured annotations layered on top:
- Action representation. End-effector pose deltas (6-DOF + gripper) vs. joint-space targets. Ke et al. in the GenDP paper (arXiv:2412.13877) use 3D action anchors defined relative to the pointcloud, achieving 76% zero-shot success on unseen objects because actions are grounded in scene geometry rather than camera pixels.
- Task segmentation. Long-horizon demonstrations need phase labels (approach, grasp, transport, place, release). Without these, diffusion policies learn blurred multimodal distributions at phase boundaries.
- Object instance labels. Bounding boxes or segmentation masks per object per frame. This is expensive. Auto-labeling tools such as Grounded-SAM reduce manual annotation effort by approximately 70%, but you still need human QA on at least 10% of frames to bound error.
- Success / failure label. Binary, per-episode. This sounds trivial, but it requires a precise definition document per task family. "The cup is placed upright within 2 cm of the target marker" is a usable spec. "The task was completed" is not.
- Grasp taxonomy. For dexterous manipulation, label the grasp type (power, precision, lateral, tripod) per contact event. The DexMV pipeline from UC San Diego (arXiv:2104.08521) implicitly conditions on grasp type through its hand-pose retargeting, but explicit labels enable stratified evaluation and data balancing.
Treat your annotation schema as an API contract between data producers and model consumers. Version it the same way you version code.
Stage 4: Quality filtering thresholds
A well-run robotics data collection campaign should expect a 20–40% episode rejection rate after quality filtering, based on published results from the RoboTurk and DROID projects. Not all collected episodes should enter the training set. The filtering pipeline:
- Success filter. Remove all episodes labeled as failures unless you are explicitly training a classifier or reward model. For behavioral cloning, including failures degrades performance—Mandlekar et al. (arXiv:1909.12200) demonstrated this on the RoboTurk lifting task.
- Kinematics filter. Reject episodes where joint velocities exceed 95th-percentile thresholds (indicating jerky or panicked operator behavior) or where the end-effector leaves the calibrated workspace.
- Temporal consistency. Reject episodes with >2% dropped frames or timestamp jitter >10 ms. This matters for action-chunking architectures that assume uniform dt.
- Annotation completeness. Any episode missing a required annotation field is quarantined, not silently included.
- Outlier detection. Compute trajectory embeddings (e.g., via a pre-trained visual encoder), flag episodes >3σ from the cluster center for manual review.
| Filter | Typical Rejection Rate | Impact if Skipped |
|---|---|---|
| Success filter | 10–30% | Policy learns to imitate failure modes |
| Kinematics filter | 3–8% | Action distribution has heavy tails that destabilize diffusion sampling |
| Temporal consistency | 1–5% | Action prediction at chunk boundaries becomes noisy |
| Annotation completeness | 2–10% | Missing labels cause silent training bugs or NaN losses |
| Outlier detection | 1–3% | Rare but catastrophic: corrupted demos that dominate loss on small datasets |
If your vendor reports less than 5% rejection, they are either not filtering rigorously or their collection quality is genuinely exceptional. Ask which one and ask for evidence.
Stage 5: Dataset versioning and governance
Robotics datasets are living artifacts that require formal versioning and governance infrastructure. Unlike static benchmarks such as ImageNet, robotics training datasets are continually extended with new tasks, environments, and morphologies. Production-grade dataset management requires:
- Immutable episode IDs. Every episode gets a UUID at collection time. Never reuse or reassign.
- Schema versioning. When the annotation schema changes (e.g., adding a new label field), increment the dataset schema version. Old episodes get the new field backfilled or explicitly marked null. Never silently drop them.
- Reproducible splits. Train/val/test splits defined by a deterministic hash of episode IDs, stored in a manifest file. Splits must be stratified by environment, operator, and task to prevent data leakage.
- Provenance tracking. For every episode: collection cell ID, operator ID, hardware config version, calibration timestamp, annotation pipeline version, and all filter pass/fail flags. This is the minimum metadata needed to diagnose a policy regression.
- Consent and licensing. If operators are filmed (egocentric setups), IRB/ethics approval and data-use agreements must be on file before collection begins. The egocentric video community learned this the hard way.
Toyota Research Institute and Google DeepMind both maintain internal dataset registries with these properties. No open-source tool fully solves this for multi-site robotics data; most teams build custom tooling on top of DVC, Weights & Biases Artifacts, or cloud object stores with metadata sidecars.
Dataset versioning is unglamorous and chronically under-invested. It is also what separates a research artifact from a production training asset.
Evaluation checklist: Comparing training data vendors for robotics
If you are evaluating an external data provider for physical AI training data, here is a vendor-agnostic checklist distilled from the pipeline above:
| Capability | Question to Ask | Red Flag |
|---|---|---|
| Sensor suite | What cameras, depth sensors, and proprioceptive streams do you record? At what rate? | No depth sensor; frame rate below 30 fps; no force/torque |
| Calibration | How often do you re-calibrate? What is your reprojection error threshold? | "We calibrate at setup" (once, without re-verification) |
| Operator qualification | What is your operator pass rate? Do you track per-operator metrics? | No qualification process or no per-operator tracking |
| Annotation schema | Can you share your schema spec document? Is it versioned? | No written spec; annotations added ad hoc |
| Quality filtering | What is your typical episode rejection rate? What filters do you apply? | Less than 5% rejection with no explanation |
| Dataset versioning | How do you handle schema changes? Are splits deterministic? | "We just add new episodes to the folder" |
| Scalability | How many episodes per week can you sustain across how many environments? | Cannot cite a production throughput number |
Claru operates the full pipeline from sensor-calibrated collection cells through operator management, structured annotation (including data enrichment for 3D action labels and language captions), automated quality filtering, and versioned dataset delivery. We built this for teams training manipulation policies at the scale that VLAs and diffusion policies require. If your ML team spends more than 40% of its time on data infrastructure rather than modeling, that ratio is wrong.
Key takeaways
- Mandlekar et al. at Stanford (arXiv:1909.12200) showed that top-quartile operators produce demonstrations yielding 2× the policy success rate compared to unfiltered crowdsourced pools. Operator qualification is the single highest-leverage intervention in a robotics training data pipeline.
- Ke et al. in the GenDP paper (arXiv:2412.13877) achieved 76% zero-shot success on unseen objects by grounding actions in 3D pointclouds, which requires metric depth sensors with ≤2 mm noise registered to RGB.
- The DexMV pipeline from UC San Diego (arXiv:2104.08521) showed that retargeting human hand demonstrations to dexterous robots like the Allegro Hand requires per-morphology calibration to avoid joint-limit violations and action distribution artifacts.
- A well-run collection campaign should expect 20–40% episode rejection after quality filtering. Rejection rates below 5% indicate insufficient filtering, based on published rates from the RoboTurk and DROID projects.
- Dataset versioning (immutable episode IDs, schema version tracking, deterministic stratified splits) is what separates a research artifact from a production training data asset.
- Camera calibration drift of 3–5 mm over multi-week campaigns corrupts pointcloud registration silently, according to the DROID dataset maintainers at Stanford and Toyota Research Institute. Automated re-verification catches this; manual spot checks do not.
- If your ML team spends more than 40% of its time on data infrastructure instead of model development, externalize the data pipeline.
FAQ
What training data do robots need?
Robots trained via behavioral cloning or reinforcement learning from demonstrations need multi-modal trajectory data: synchronized RGB video (typically ≥2 views), depth or pointcloud data, proprioceptive state (joint angles and velocities at ≥200 Hz), force/torque readings for contact-rich tasks, and structured annotations including task-phase segmentation, success labels, and language instructions.
The specifics depend on the policy architecture. Diffusion policies need dense action chunks at 50–100 Hz. VLAs additionally require paired language captions per episode. Ke et al. in the GenDP paper (arXiv:2412.13877) showed that grounding actions in 3D pointcloud coordinates rather than pixel space enables zero-shot transfer to unseen objects, which makes metric depth data increasingly non-optional for generalizable manipulation.
How much training data is needed for robot learning?
The amount of training data for robotics depends on task complexity and desired generalization. Single-task behavioral cloning on a fixed object set can work with 100–200 demonstrations (early Diffusion Policy results from Chi et al. at Columbia showed this), but multi-task, multi-object policies need 10,000–100,000+ episodes. The Open X-Embodiment collaboration used over 1 million episodes for cross-embodiment transfer. UC Berkeley's BridgeData V2 collected 60,562 demonstrations and still reported scene diversity—not dataset size—as the binding constraint on generalization.
Quality-adjusted effective dataset size matters more than raw episode count. Mandlekar et al. at Stanford (arXiv:1909.12200) showed that 500 expert demonstrations outperformed 2,000 mixed-quality demonstrations on the same task. For a deeper analysis of scaling requirements, see Claru's coverage of VLA training data volume.
How to evaluate a robotics training data vendor?
Evaluate a robotics training data vendor by asking five concrete questions:
- What is the sensor suite, frame rate, and synchronization tolerance? Anything below 30 fps or without depth is insufficient for modern diffusion or VLA policies.
- How are operators qualified, and what is the qualification pass rate? Mandlekar et al. in the RoboTurk study (arXiv:1909.12200) showed that operator skill is the dominant predictor of downstream policy quality.
- Is the annotation schema versioned and documented? If the vendor cannot share a written spec, annotation consistency will be low.
- What is the typical episode rejection rate after quality filtering? Based on published data from RoboTurk and DROID, expect 20–40% for a rigorous pipeline.
- How are dataset versions, splits, and provenance tracked? If the answer involves manual folder management, you will face reproducibility problems at scale.
A vendor who can answer all five with specific numbers is operating at production grade.
What is the sim-to-real gap in robotics training data?
The sim-to-real gap is the performance drop that occurs when a policy trained in simulation is deployed on a physical robot. The three most common causes are visual domain shift (simulated textures and lighting differ from reality), dynamics mismatch (simulated contacts and friction do not match real physics), and sensor noise that is not modeled in simulation.
The DexMV pipeline from UC San Diego (arXiv:2104.08521) addressed part of this gap by collecting real human hand demonstrations and retargeting them to the robot's kinematic model, bypassing the visual sim-to-real gap entirely. The current consensus at frontier labs including Google DeepMind and Toyota Research Institute is that some real-world data is irreplaceable—simulation can augment and pre-train, but final policy fine-tuning on real embodied AI datasets remains necessary for reliable deployment.
What annotation formats are used for robotics datasets?
The most widely adopted annotation format for robotics training data is RLDS (Reinforcement Learning Datasets), which Google DeepMind developed and used for the Open X-Embodiment project. RLDS wraps data in TFRecord with a standardized schema. The typical per-episode container (whether HDF5, Zarr, or RLDS) contains: RGB images as compressed video streams (H.264/H.265), depth maps as 16-bit PNG sequences, proprioceptive state as float32 arrays at the logging rate, actions as end-effector pose deltas or joint-space targets at the policy control frequency (typically 10–50 Hz), per-frame timestamps synchronized to a common clock, and metadata fields for task ID, language instruction, success label, operator ID, and collection cell configuration.
There is no single universal standard yet, but any format that separates raw sensor streams from derived annotations and includes schema versioning will be maintainable at scale.
Related resources
- Training Data for Robotics — Claru's overview of data requirements for manipulation and locomotion policies.
- VLM vs. VLA: What's the Difference? — Understanding the model architectures that consume this data.
- Glossary: Key Terms in Physical AI — Definitions for behavioral cloning, diffusion policy, action chunking, and other terms used in this post.