TL;DR
- Chi et al. (Columbia, arXiv:2305.12171) showed Diffusion Policy achieves 85.7% average success on Push-T with ~200 demonstrations. Success degrades sharply below ~100 demos on multi-modal tasks.
- Ze et al. (arXiv:2409.00588) showed 3D Diffusion Policy (DP3) with sparse point-cloud observations hits 83.6% average success across Adroit and MetaWorld tasks, beating 2D counterparts by 34.6 percentage points. 3D-structured data appears more sample-efficient than RGB alone.
- Ke et al. (arXiv:2410.14868) found that replacing DDPM with flow matching in 3D Diffusion Policy cuts inference to 28 ms per step while maintaining or improving task success. This reduces the action-chunking horizon needed and changes data collection requirements.
- Demonstration quality matters more than raw count. Low-variance trajectories collected at consistent control frequencies beat large noisy datasets: temporally inconsistent demonstrations reduce policy smoothness even at scale.
In this post
Why data matters more than architecture for diffusion policy robotics
Diffusion policy in robotics refers to a class of visuomotor policy architectures that use iterative denoising—borrowed from diffusion models in generative AI—to predict robot action sequences from sensory observations. Most writing about diffusion policy focuses on the denoising process: DDPM versus DDIM, U-Net versus Transformer backbone, number of diffusion steps. These matter, but they are not where most real-world deployments fail. Deployments fail on data.
Chi et al. at Columbia (arXiv:2305.12171) introduced Diffusion Policy to model multi-modal action distributions via iterative denoising over action sequences. Diffusion models handle multi-modality (e.g., going left or right around an obstacle) far better than MSE-regression policies, which average modes and produce frozen or erratic behavior. But this expressiveness cuts both ways: the model will faithfully reproduce whatever distribution your data encodes, including its gaps and biases.
This post is a data specification document, not an architecture tutorial. If you are building a diffusion policy training pipeline for a physical manipulation task, the decisions below will determine your success rate more than your choice of noise schedule.
Dataset size thresholds: how many demonstrations do you actually need?
Diffusion policy robotics systems require between 50 and tens of thousands of demonstrations depending on task complexity, observation representation, and generalization scope. The published literature provides useful anchor points across this range.
Single-task, single-object benchmarks. Chi et al. (arXiv:2305.12171) report 85.7% success on Push-T and 95.0% on the Can task in RoboMimic using roughly 200 human teleoperation demonstrations. Below approximately 100 demonstrations, performance on multi-modal tasks drops steeply because the diffusion model cannot cover the action distribution's modes. For unimodal tasks (e.g., a single reaching motion), as few as 50 demonstrations can work, but this is a degenerate case you rarely encounter in deployment.
3D observation representations are more sample-efficient. Ze et al. (arXiv:2409.00588) showed that 3D Diffusion Policy (DP3), conditioning on sparse point clouds rather than RGB images, achieves 83.6% average success across 72 tasks spanning Adroit dexterous manipulation and MetaWorld benchmarks. DP3 beats image-based Diffusion Policy by 34.6 percentage points on the same task suite according to Ze et al. (arXiv:2409.00588). The reason is straightforward: 3D representations encode geometric structure directly, so the policy needs fewer demonstrations to learn spatial relationships that RGB policies must infer from pixels. If you have depth sensors, investing in point-cloud preprocessing will reduce your training data collection burden.
Multi-task and generalization settings. When the goal is generalization across object categories, initial poses, or environments, dataset size requirements jump by 10–50×. The RoboCasa benchmark uses thousands of demonstrations across kitchen tasks with procedurally varied object placements. Teams training on the Open X-Embodiment dataset operate at the scale of hundreds of thousands of trajectories across embodiments. No published study has identified a point where diffusion policy saturates on multi-task data; in practice, performance continues to improve log-linearly with data up to the largest scales tested.
A practical rule of thumb: budget 200 demonstrations per task variant (where a "variant" means a distinct object category or workspace configuration), and plan to 3× that if you need >90% reliability.
Demonstration quality criteria: what makes a good demo
Demonstration quality has a larger effect on diffusion policy robotics performance than raw demonstration count. The four specific quality dimensions that most affect training outcomes are temporal consistency, trajectory variance, observation completeness, and action-space representation.
Temporal consistency. Diffusion Policy predicts action chunks—sequences of future actions, typically 8–16 steps. If your demonstration data was collected at an inconsistent control frequency (e.g., a human teleoperator pausing mid-trajectory, or dropped frames in recording), the denoising network learns to reproduce those hesitations as part of the action distribution. Chi et al. (arXiv:2305.12171) note that action chunking provides "temporal action consistency" that suppresses idle oscillations, but this only works if the training data itself is temporally smooth.
Trajectory variance within the success set. You want diverse strategies across demonstrations but low variance and commitment within each individual trajectory. A dataset where half the demonstrations approach from the left and half from the right is great: the diffusion model handles bi-modality well. A dataset where individual demonstrations waver between left and right approaches within a single trajectory is harmful, because it teaches the policy to oscillate.
Observation completeness. Ze et al. (arXiv:2409.00588) found that point-cloud quality—number of points, occlusion handling, noise level—directly affects downstream 3D Diffusion Policy performance. Ze et al. use sparse point clouds (typically 1,024–4,096 points) and recommend consistent point sampling density across demonstrations. If your depth sensor placement changes between data collection sessions, you introduce distribution shift that the policy cannot recover from.
Action space representation. Demonstrations recorded in joint space versus Cartesian end-effector space produce different diffusion policy behaviors. Most successful deployments (including Chi et al.'s original Diffusion Policy work in arXiv:2305.12171) use end-effector velocity or delta-position actions, which produce smoother diffusion targets than joint-space actions. If you record joint-space demonstrations, make sure your trajectories are kinematically consistent. Redundant-DOF robots can produce multiple joint configurations for the same end-effector pose, creating artificial multi-modality that wastes model capacity.
Collecting high-quality egocentric demonstration data at consistent frame rates is harder than it sounds, and it is the single largest bottleneck for most robotics teams.
Action chunking and its downstream data collection implications
Action chunking—predicting a sequence of T_a future actions and executing T_e ≤ T_a of them before re-planning—is central to diffusion policy robotics performance. Chi et al. (arXiv:2305.12171) found that a prediction horizon of T_p = 16 and an action execution horizon of T_a = 8 at 10 Hz control frequency worked well across their benchmarks. This has three direct implications for how you collect data:
- Minimum trajectory length. Each demonstration must be at least T_p steps long, but practically trajectories need to be much longer—100+ steps at 10 Hz equals 10 seconds minimum—for the policy to learn approach-grasp-lift sequences. Very short demonstrations (under 2 seconds) do not give the diffusion model enough temporal context.
- Control frequency must be fixed and known. The chunk size is defined in control steps, not wall-clock time. If you collect data at 30 Hz but train with T_p = 16, each chunk spans 0.53 seconds. At 10 Hz, each chunk spans 1.6 seconds. This changes what the policy treats as a single "action decision." Mixing control frequencies within a dataset without resampling will degrade performance.
- Faster inference changes chunk requirements. Ke et al. (arXiv:2410.14868) proposed 3D Flow Policy (3D-FP), replacing the DDPM denoising process with flow matching (building on Lipman et al.'s conditional flow matching framework). 3D Flow Policy runs inference in 28 ms per step according to Ke et al. (arXiv:2410.14868), roughly 5× faster than the standard DDPM variant. You can re-plan more frequently, which reduces the action execution horizon T_e needed. With faster re-planning, shorter action chunks become viable, and your demonstrations can be slightly less perfect since the policy corrects faster. But the data collection frequency requirement actually increases because you need higher-resolution temporal data to support shorter planning cycles.
The interaction between chunk size, control frequency, and inference speed means there is no universal "best" action chunking configuration. You must co-design it with your data collection protocol.
Data diversity: object pose, lighting, and embodiment variance
Diversity in the training distribution is what separates a diffusion policy that works in one lab setup from one that survives deployment. The table below summarizes minimum diversity thresholds drawn from published benchmarks.
| Diversity axis | Minimum recommended variance | Impact on policy | Evidence source |
|---|---|---|---|
| Object initial pose | ±15 cm translation, ±45° rotation across demos | Prevents overfitting to single grasp approach | Chi et al. (arXiv:2305.12171) Push-T results |
| Lighting conditions | ≥3 distinct lighting setups (for RGB-based policies) | Reduces visual encoder fragility | Standard domain-randomization findings |
| Camera viewpoint | ≥2 cameras with ≥30° baseline separation | Improves depth estimation and spatial reasoning | Ze et al. (arXiv:2409.00588) multi-view results |
| Object instances | ≥5 instances per category for category-level generalization | Prevents texture/shape memorization | RoboCasa benchmark design |
| Background variation | ≥3 distinct table/workspace surfaces | Reduces background-dependent behavior | Empirical; no published threshold |
| Embodiment (cross-robot) | Not recommended unless using shared representation | Negative transfer without adaptation layers | Sun et al. (arXiv:2502.10040) |
3D observations partially bypass visual diversity requirements. Ze et al. (arXiv:2409.00588) show that DP3's point-cloud conditioning is more robust to lighting changes and camera viewpoint shifts than RGB-conditioned policies, because geometry is lighting-invariant. This does not eliminate the need for object pose diversity (you still need varied initial configurations) but it reduces how much lighting and background variation your dataset requires.
Embodiment transfer is still an open problem. Sun et al. (arXiv:2502.10040) survey diffusion-based policy learning across manipulation tasks and note that cross-embodiment transfer—training on data from one robot, deploying on another—requires either explicit action-space alignment or shared latent representations. Mixing demonstrations from a Franka Panda and a UR5 into one Diffusion Policy dataset without embodiment-specific action decoders will produce worse results than training on either robot alone, according to Sun et al. (arXiv:2502.10040). If cross-embodiment generalization is your goal, look at VLA architectures that handle embodiment conditioning at the foundation-model level.
The practical ordering: invest in pose and object-instance diversity first, lighting diversity second (or switch to 3D observations), and avoid cross-embodiment mixing without architectural support.
A concrete data specification checklist
The following table provides a minimum data specification for a diffusion policy robotics deployment on a single manipulation task (e.g., pick-and-place of a known object category). Each parameter is tied to published evidence or reproducible benchmark results.
| Parameter | Specification | Rationale |
|---|---|---|
| Demonstrations per task | 200 minimum; 600+ for >90% target | Chi et al. (arXiv:2305.12171) observed log-linear scaling |
| Control frequency | Fixed at 10–30 Hz; choose before collection, do not mix | Chunk size is defined in timesteps, not wall-clock time |
| Min trajectory length | 100 steps (10 s at 10 Hz) | Must exceed T_p by ≥6× for sufficient temporal context |
| Object pose variation | ±15 cm, ±45° rotation per demonstration | Covers approach-angle diversity per Chi et al. Push-T results |
| Object instances | ≥5 per target category | Category-level generalization requires shape and texture variance |
| Camera count | ≥2 RGB or 1 depth sensor for point clouds | DP3 per Ze et al. (arXiv:2409.00588) requires 3D input |
| Lighting setups | ≥3 for RGB; ≥1 for point-cloud policies | 3D representations bypass lighting sensitivity |
| Action representation | End-effector delta position + gripper state | Smoother diffusion targets than joint space per Chi et al. |
| Trajectory filtering | Remove demos with >500 ms pauses or success rate <100% | Temporal consistency for action chunking |
| Frame drops | <1% dropped frames; interpolate if needed | Chunk integrity for denoising training |
For teams that need large-scale, high-quality physical AI training data (egocentric manipulation demonstrations with controlled diversity), Claru collects first-person video and sensor data with precise metadata (object identity, pose, lighting condition, camera intrinsics) at fixed frame rates. This maps directly to the temporal consistency and diversity requirements above.
Google DeepMind has used diffusion-based policy architectures in their RT-2 and follow-on work, and their internal data pipelines enforce similar quality constraints on action-space consistency and trajectory filtering. No other major lab has published exact data-spec thresholds, which is why this checklist synthesizes from benchmark results rather than production disclosures.
Key takeaways
- Chi et al. (arXiv:2305.12171) showed Diffusion Policy reaches 85.7% success on Push-T with ~200 demonstrations. Multi-modal tasks need at least 100 demos to cover the action distribution.
- Ze et al. (arXiv:2409.00588) showed 3D Diffusion Policy beats image-based Diffusion Policy by 34.6 percentage points on average across 72 tasks. Point-cloud data collection is a direct lever for sample efficiency.
- Ke et al. (arXiv:2410.14868) showed 3D Flow Policy achieves 28 ms inference, enabling shorter action execution horizons and more frequent re-planning, which changes optimal data collection frequency.
- Temporal consistency in demonstrations (fixed control frequency, no pauses, no dropped frames) matters more than raw count for action-chunking policies.
- Object pose and instance diversity give you the most return. Lighting diversity can be partially bypassed by switching from RGB to point-cloud observations.
- Cross-embodiment data mixing without explicit action-space alignment degrades Diffusion Policy performance according to Sun et al. (arXiv:2502.10040).
- Budget 200 demonstrations per task variant as a minimum. Plan for 3× that if targeting >90% reliability in deployment.
FAQ
How many demonstrations does Diffusion Policy need for manipulation?
Diffusion Policy needs approximately 200 human teleoperation demonstrations to achieve strong single-task manipulation performance, according to Chi et al. (arXiv:2305.12171), who reported 85.7% success on Push-T and 95.0% on RoboMimic Can at that scale. That figure applies to a single object in a fixed environment. For category-level generalization (picking up any mug, not just one specific mug), budget 200 demonstrations per distinct task variant (unique object category × workspace configuration). Tasks with multi-modal action distributions (multiple valid strategies) require more data to cover each mode; below approximately 100 demonstrations, Chi et al. observed sharp performance degradation. Teams targeting production reliability above 90% should plan for 600+ demonstrations per task variant.
Does 3D Diffusion Policy need less training data than image-based Diffusion Policy?
3D Diffusion Policy (DP3) is more sample-efficient than image-based Diffusion Policy in practice. Ze et al. (arXiv:2409.00588) showed DP3 (conditioning on sparse point-cloud observations) outperforms image-based Diffusion Policy by 34.6 percentage points on average across 72 tasks in Adroit and MetaWorld benchmarks. The gain comes from geometric structure in 3D inputs: the policy does not need to learn depth, spatial relationships, or lighting invariance from raw pixels. 3D policies can often match RGB policy performance with fewer demonstrations, and they are more robust to lighting and background changes, reducing diversity requirements along those axes.
What action chunk size should I use for Diffusion Policy?
Chi et al. (arXiv:2305.12171) found a prediction horizon of T_p = 16 steps and an action execution horizon of T_a = 8 steps at 10 Hz control frequency worked well across their benchmarks. The right chunk size depends on your control frequency and inference speed. Ke et al. (arXiv:2410.14868) showed flow matching reduces inference to 28 ms per step, enabling more frequent re-planning and potentially shorter execution horizons. The hard constraint: your demonstration trajectories must be collected at a fixed, known control frequency, and each trajectory needs to be at least 6× T_p long so the model sees enough temporal context during training.
How does demonstration quality affect Diffusion Policy performance?
Demonstration quality is arguably more important than demonstration quantity for diffusion policy robotics. Because the architecture predicts action chunks (sequences of 8–16 future actions per Chi et al. in arXiv:2305.12171), any temporal inconsistency in the training data—pauses, dropped frames, wavering between strategies within a single trajectory—gets encoded as part of the learned action distribution. At inference time, this shows up as jittery or hesitant behavior. Each demonstration should be temporally smooth, collected at a consistent control frequency (10–30 Hz with <1% dropped frames), and represent a committed strategy for completing the task. Filter out demonstrations where the operator paused for more than 500 ms or failed partway through. Diverse strategies across demonstrations (some go left, some go right) are good; indecision within a single demonstration is not.
Can I mix data from different robots to train Diffusion Policy?
Mixing data from different robots to train a single Diffusion Policy is not recommended without explicit architectural support. Sun et al. (arXiv:2502.10040) note that cross-embodiment transfer requires explicit action-space alignment or shared latent representations. Mixing demonstrations from a Franka Panda and a UR5 into a single Diffusion Policy training data set without embodiment-specific action decoders will typically produce worse results than training on either robot's data alone, according to Sun et al. The action spaces, joint limits, and kinematic structures differ enough that the diffusion model cannot produce coherent action chunks. If you need cross-embodiment generalization, consider VLA-style architectures that condition on embodiment tokens, or use a shared end-effector action space with robot-specific low-level controllers.