How to Generate Synthetic Robot Data
A practitioner's guide to generating synthetic training data for robot learning — selecting physics simulators, implementing domain randomization, creating photorealistic assets, validating sim-to-real transfer, and optimally mixing synthetic data with real demonstrations.
Prerequisites
- Physics simulator (Isaac Gym, MuJoCo, or PyBullet) installed and configured
- URDF or MJCF model of your robot
- 3D models of task objects (or procedural generation pipeline)
- GPU for simulation (NVIDIA A100 or RTX 4090 recommended)
- Real-world validation dataset (100+ episodes) for sim-to-real gap measurement
Select and Configure the Physics Simulator
Choose a simulator based on your requirements for speed, accuracy, and rendering quality. Isaac Gym excels at parallelized GPU simulation — it runs 4,096 environments simultaneously on a single A100, generating manipulation episodes at 10,000+ per hour. MuJoCo provides superior contact physics accuracy for tasks involving sliding, rolling, and deformable objects, running at 500-2,000 episodes per hour on CPU. For photorealistic visual data, pair either simulator with a ray-tracing renderer: Isaac Sim (built on NVIDIA Omniverse) produces photorealistic images at 5-15 FPS per environment.
Import your robot's URDF model into the simulator and verify joint limits, link geometries, and collision meshes match the real robot. Run a simple position-control test: command the simulated robot to 20 joint configurations and compare the resulting end-effector positions against forward kinematics computed from the URDF — they should match within 0.1mm. Import task object meshes (from 3D scanning, CAD models, or procedural generation) and set material properties (friction, mass, restitution) to approximate real-world values. If exact material properties are unknown, use conservative estimates and randomize them during data generation.
Configure the simulation timestep based on your task's dynamics: 1ms (1000 Hz) for fast contact events, 2-5ms (200-500 Hz) for standard manipulation. Lower timesteps are more accurate but slower. Set the rendering camera to match the real camera's intrinsic parameters (focal length, resolution, distortion) so simulated images are geometrically consistent with real images.
Tip: Run a sim-to-real physics validation before generating data at scale: drop a real object and a simulated object from the same height and compare bounce trajectories — if they differ significantly, adjust the simulator's contact parameters until they match
Implement Domain Randomization
Domain randomization varies visual and physical parameters across episodes so the trained policy learns to be invariant to these factors. Implement randomization for: (1) Visual parameters — camera position (jitter by 2-5cm and 5-10 degrees from nominal), lighting direction (sample from hemisphere above workspace), lighting intensity (0.5-2x nominal), lighting color temperature (3000-7000K), object textures (sample from a library of 50+ textures), table/background textures (sample from 20+ backgrounds), and image post-processing (brightness, contrast, blur, noise). (2) Physical parameters — object mass (0.5-2x nominal), friction coefficients (0.3-1.0 for typical objects), object dimensions (0.9-1.1x nominal for shape variation), robot joint damping (0.8-1.2x nominal), and control delay (0-20ms to simulate real controller latency).
Implement randomization as a configuration file that specifies the distribution (uniform, Gaussian, categorical) and range for each parameter. At the start of each episode, sample all parameters from their distributions. Log the sampled parameters alongside the episode data so you can analyze which parameter ranges cause the most policy failures.
Use progressive domain randomization during training: start with low randomization (close to real-world nominal values) and gradually increase the randomization range over training epochs. This curriculum approach produces more stable training than starting with maximum randomization, because the policy first learns the task in easy conditions and then generalizes to harder variations.
Tip: Generate a 'randomization impact report' by training policies with each parameter randomized individually and measuring the sim-to-real gap — this identifies which parameters matter most and which can be left at nominal values, reducing unnecessary randomization that adds noise
Generate Episodes at Scale
Design a data generation pipeline that runs on a GPU cluster and produces episodes at maximum throughput. For Isaac Gym, use the parallel environment API: create 1,000-4,096 parallel environments on a single A100, run scripted or policy-based agents in each environment, and collect episodes at 2,000-10,000 per hour.
For scripted demonstrations (no trained policy needed), implement expert heuristics for your task: compute the grasp pose from the object's known pose, plan a trajectory from the current position to the grasp pose using RRT or linear interpolation, execute the grasp, and transport to the target. Scripted demonstrations are fast to generate but limited in diversity — the heuristic always uses the same strategy. For more diverse demonstrations, use a trained RL policy to generate episodes: train a task-specific RL policy in simulation (PPO with 1-4 hours of training), then use this policy to generate demonstration data. The RL policy naturally explores different strategies, producing more diverse demonstrations than scripted heuristics.
Structure the pipeline as a producer-consumer system: producer processes run simulation environments and generate episodes in memory, consumer processes write episodes to disk in the target format (RLDS, HDF5). Use a shared queue between producers and consumers to decouple simulation speed from I/O speed. Implement checkpointing: save the pipeline state every 1,000 episodes so a crashed job can resume without re-generating data.
Quality filter generated episodes before adding them to the dataset: discard episodes where the task was not completed (the scripted or RL agent failed), where physical anomalies occurred (objects clipping through surfaces, unrealistic velocities), or where domain-randomized parameters produced unrealistic configurations (object floating in mid-air due to extreme mass randomization).
Tip: Profile the pipeline to identify bottlenecks — for Isaac Gym, the bottleneck is usually rendering (if enabled) not physics, so consider generating physics-only episodes for proprioceptive policies and adding rendering only for vision-based policies
Validate Sim-to-Real Transfer Quality
Before using synthetic data for training, measure the sim-to-real gap on your specific task. Train two policies: one on 100% real data and one on 100% synthetic data. Evaluate both on the same set of 50+ real-world trials. The difference in success rates is the sim-to-real gap.
Analyze the gap by failure mode: if the synthetic-trained policy fails primarily on grasp execution (contact-related), the gap is in physics fidelity — improve contact parameters or add more real data for the contact phase. If it fails primarily on object localization (perception-related), the gap is in visual rendering — improve visual domain randomization or use more photorealistic rendering.
Compute distribution statistics: compare the distribution of trajectories (end-effector positions over time) between synthetic and real datasets. Use Maximum Mean Discrepancy (MMD) or Frechet Distance to quantify how similar the distributions are. High distributional distance indicates systematic differences between simulated and real robot behavior — investigate and correct the simulation parameters.
Run an ablation on mixing ratios: train policies with synthetic:real ratios of 100:0, 90:10, 80:20, 70:30, 50:50, 30:70, 0:100. Evaluate each on real-world trials. The optimal ratio is typically 60-80% synthetic, 20-40% real, but varies by task. Plot success rate vs. mixing ratio to find the sweet spot for your task.
Tip: The sim-to-real gap is not fixed — it changes as you improve the simulator, add domain randomization, or change the policy architecture. Re-measure the gap whenever you make significant changes to the simulation or training pipeline
Mix Synthetic and Real Data for Training
Combine synthetic and real data using a weighted sampling strategy during training. The simplest approach is uniform mixing: at each training step, sample a batch where X% of examples come from synthetic data and (100-X)% from real data, where X is your optimal mixing ratio from the ablation study.
A more sophisticated approach is curriculum mixing: start training with 90% synthetic data (to learn the coarse task structure) and gradually shift to 70% real data over the course of training (to fine-tune on real-world physics and visuals). This curriculum produces more stable training because the policy does not encounter the hard real-world distribution until it has learned the basic task from easier synthetic examples.
Implement data source tagging: every training example includes a binary flag indicating synthetic or real origin. Some policy architectures can use this flag as a conditioning signal, learning different action distributions for synthetic and real observations. This prevents the policy from being confused by systematic differences between synthetic and real images.
Monitor per-source metrics during training: track the loss separately for synthetic and real examples. If the real data loss stops decreasing while the synthetic data loss continues to drop, the model is overfitting to synthetic data characteristics — increase the real data proportion or add regularization. If both losses decrease consistently, the mixing ratio is well-calibrated.
After training, evaluate exclusively on real-world trials. Never evaluate on synthetic data — the results are not representative of real-world performance. Compare the mixed-data policy against the real-data-only baseline to quantify the value added by synthetic data.
Tip: Keep synthetic and real datasets in separate storage directories with clear naming — accidentally contaminating a real dataset with synthetic episodes (or vice versa) creates subtle training distribution shifts that are extremely difficult to diagnose
Tools & Technologies
How Claru Can Help
Claru generates synthetic data using Isaac Gym and Isaac Sim with calibrated domain randomization pipelines validated against real-world data. We provide mixed synthetic+real datasets optimized for your specific sim-to-real gap, with ablation results showing the optimal mixing ratio. Our simulation assets are calibrated against real object scans and material measurements for minimal sim-to-real gap.
Why Synthetic Data Generation Matters for Robot Learning
Synthetic data generated in physics simulators can supplement real-world demonstrations at 1000x the collection speed and near-zero marginal cost. Isaac Gym generates 10,000 simulated manipulation episodes per hour on a single A100 GPU, compared to 20-40 real-world episodes per hour per operator. However, synthetic data has a fundamental limitation: the simulation is not reality. Contact physics in MuJoCo and Isaac Gym approximates but does not replicate real-world friction, deformation, and surface interactions. Visual rendering, even with ray-tracing, lacks the photographic complexity of real camera images. The gap between synthetic and real data — the sim-to-real gap — determines how much value synthetic data adds to training.
The optimal strategy is not to replace real data with synthetic data, but to use synthetic data strategically: for exploration of the policy's action space (generating diverse trajectories that a human operator would not think to demonstrate), for data augmentation (rendering the same trajectory under varied visual conditions), and for negative example generation (producing collisions and failures that are dangerous to collect in the real world). Research from NVIDIA showed that mixing 70% synthetic data with 30% real data outperformed training on either data source alone, with the synthetic data providing diversity and the real data providing physical grounding.
Key Benchmarks
Frequently Asked Questions
Isaac Gym (NVIDIA) for GPU-parallelized simulation of rigid and deformable bodies — it is the fastest option for large-scale data generation, running 4,000+ environments in parallel on a single A100. MuJoCo for accurate contact physics and research reproducibility — it is the standard in the manipulation research community and supports complex contact scenarios. PyBullet for a free, easy-to-setup option suitable for prototyping. For photorealistic rendering, use Isaac Sim (built on Omniverse) or use MuJoCo for physics and pipe the states to a separate renderer (Blender, Unity).
Randomize at minimum: camera pose (5-10 degree jitter), lighting (direction, intensity, color temperature), object texture, table texture, and object position. For each parameter, define a uniform or Gaussian distribution centered on the real-world nominal value with a range that covers expected deployment variation. Start with moderate randomization and increase until validation performance on real-world data stops improving. Over-randomization can hurt: if you randomize friction coefficients too aggressively, the simulated physics no longer resembles reality and the data becomes noise.
Not yet for most manipulation tasks. Synthetic data excels for navigation, grasping pre-shapes, and coarse reaching motions where contact physics accuracy is less critical. For contact-rich manipulation (insertion, assembly, tool use), real-world data is still essential because the sim-to-real gap in contact physics is large. The current best practice is to use synthetic data for the majority of training data (60-80%) and fine-tune with real-world data (20-40%). Teams that attempt 100% synthetic training typically see a 15-30% performance drop compared to this mixed approach.
Need Help with Synthetic Data Generation?
Claru provides expert services for synthetic robot data generation. Contact us to discuss your specific requirements and get a custom data collection plan.