How to Bridge the Sim-to-Real Gap for Robot Policies

Q: How much real-world data do I need for sim-to-real fine-tuning?

For manipulation tasks with moderate visual complexity: 500-2,000 real demonstrations typically recover 80-95% of simulation performance. For contact-rich tasks (insertion, assembly): 2,000-5,000 demonstrations may be needed because the physics gap is larger. For primarily visual tasks (object detection, sorting): 50-200 real images may suffice. Start with 200-500 real demonstrations, evaluate, and collect more if the performance gap remains above 20%.

Q: Should I fine-tune the entire policy or just the vision encoder?

Fine-tune the entire policy with a reduced learning rate (1/10th of the simulation training rate). Freezing the vision encoder preserves pretrained visual features but prevents adaptation to real-world visual characteristics. Freezing the action head preserves motor skills but prevents adaptation to real actuator dynamics. Full fine-tuning with a small learning rate allows all components to adapt while staying close to the simulation-trained initialization.

Q: Which simulator should I use?

For manipulation: MuJoCo (fast, accurate contact physics, widely supported) or Isaac Sim (GPU-accelerated, photorealistic rendering). For locomotion: Isaac Gym (massive parallelism, fast RL training). For mobile manipulation: Habitat or AI2-THOR (realistic indoor environments). Choose based on your primary task — accurate contact physics matters more for manipulation, while visual fidelity matters more for perception tasks.

Q: What is the typical real-world success rate for zero-shot sim-to-real transfer?

Without any domain randomization or fine-tuning, zero-shot transfer from simulation to real hardware typically achieves 0-30% success rate for manipulation tasks — the policy works in simulation but fails on real due to visual and physics differences. With domain randomization but no real data, success rates improve to 40-70% depending on the task complexity and randomization quality. With domain randomization plus 500-2,000 real-world demonstrations for fine-tuning, success rates typically reach 70-90% of the policy's simulation performance. The OpenAI Dactyl project achieved remarkable zero-shot transfer for in-hand cube rotation by using extensive domain randomization and automatic domain randomization (ADR), but most manipulation tasks require some real-world fine-tuning.

A practical guide to transferring robot manipulation policies from simulation to real hardware. Covers system identification (matching simulation to reality), domain randomization (making the policy robust to variation), real-world fine-tuning (closing the remaining gap with demonstrations), and rigorous evaluation protocols. The goal is a policy that achieves >80% of its simulation performance on real hardware.

Difficultyadvanced

Time4-8 weeks

Prerequisites

Policy trained in simulation with reasonable success rate (>70%)
Access to the target robot hardware
Simulation environment with matching task setup
Teleoperation interface for real-world data collection

1

Measure the Baseline Gap

Before attempting any transfer, quantify the sim-to-real gap by deploying the simulation-trained policy directly on real hardware. Run 50-100 episodes of the same tasks used in simulation. Record success rate, failure modes, and qualitative behavior differences. This baseline tells you how much work is needed and which failure modes to target.

Common failure patterns and their causes: policy fails to detect objects (visual gap — different lighting, textures, backgrounds), policy reaches incorrectly for objects (camera calibration mismatch between sim and real), policy grasps but drops objects (physics gap — real friction differs from simulation), policy moves jerkily or oscillates (actuator gap — real motors have latency and compliance not modeled in simulation).

Tip: Record video of all real-world evaluation episodes. Failure mode analysis requires watching what actually happened, not just counting successes.

Tip: If baseline success rate is 0%, the gap is too large for fine-tuning alone. Go to system identification (Step 2) first.

Tip: Record video of all real-world evaluation episodes. Failure mode analysis requires watching what actually happened, not just counting successes.

Tip: If baseline success rate is 0%, the gap is too large for fine-tuning alone. Go to system identification first.

2

Improve Simulation Fidelity (System Identification)

Make your simulation match reality as closely as possible before applying domain randomization. Measure real-world parameters and update the simulation: weigh all objects and set accurate masses, measure surface friction with simple slide tests, calibrate camera intrinsics and extrinsics and use them in simulation, measure actuator latency (command-to-motion delay) and add it to the simulation, and verify that the robot's URDF/MJCF model matches the physical robot's dimensions and joint limits.

The most impactful system identification steps are camera calibration (reducing the visual gap at no data cost) and actuator latency modeling (adding a 1-3 frame delay between action commands and simulated execution). These two changes alone can improve zero-shot transfer success by 20-40%. Use a physics parameter estimation tool (like the one built into Isaac Sim) to automatically fit simulation parameters to real-world recordings.

MuJoCoIsaac SimOpenCV (camera calibration)

Tip: System identification is the highest-ROI step in sim-to-real transfer. Spend 2-3 days on it before moving to domain randomization.

Tip: Record the robot executing simple motions (waving, reaching) and compare against simulation replays to visually verify alignment.

Tip: System identification is the highest-ROI step in sim-to-real transfer. Spend 2-3 days on it before moving to domain randomization.

Tip: Record the robot executing simple motions (waving, reaching) and compare against simulation replays to visually verify alignment.

Tip: Measure actuator latency by commanding step inputs and recording the response delay. Add this delay to the simulation controller.

3

Apply Domain Randomization in Simulation

Train the policy in simulation with randomized visual and physical parameters. Visual randomization: vary textures (random noise, procedural, sampled from a texture database), lighting (2-8 point lights with random positions and intensities), camera position (+/-3cm translation, +/-5 degrees rotation), and background content (random objects, varied table textures). Physics randomization: vary object mass (+/-20% around measured values), friction (+/-30%), joint damping (+/-20%), and add random force perturbations to the object during grasping.

Train with progressive randomization: start with low randomization and increase it as the policy improves. This is more sample-efficient than training with maximum randomization from the start. Monitor simulation success rate — if it drops below 60% with randomization, reduce the ranges until the policy can maintain 70%+ success, then gradually expand.

MuJoCoIsaac SimNVIDIA Warp

Tip: Randomize visual parameters aggressively — the visual gap is usually the largest contributor and visual randomization is computationally free.

Tip: Keep physics randomization conservative and centered on measured values. Unrealistic physics produces unrealistic learned behaviors.

Tip: Randomize visual parameters aggressively since the visual gap is usually the largest contributor and visual randomization is computationally free.

Tip: Keep physics randomization conservative and centered on measured values. Unrealistic physics produces unrealistic behaviors.

Tip: Monitor simulation success rate during randomized training. If it drops below 60%, reduce randomization ranges.

4

Collect Real-World Fine-Tuning Data

Collect teleoperation demonstrations on the real robot in the target deployment environment. The fine-tuning dataset should match the deployment conditions: same robot, same cameras, same environment (or similar environments). Collect 500-2,000 demonstrations for the target tasks with the same diversity guidelines as the simulation training (varied object positions, approach angles, and initial conditions).

Pay special attention to collecting data for the failure modes identified in Step 1. If the policy failed on transparent objects in simulation, include transparent objects in the fine-tuning data. If it failed on specific lighting conditions, collect under those conditions. Targeted data collection for observed failure modes is more efficient than uniform random collection.

ROS2teleoperation interfacerecording pipeline

Tip: 500 real demonstrations that target observed failure modes are worth more than 5,000 random demonstrations.

Tip: Use the same recording pipeline and data format as the simulation data to enable seamless combined training.

Tip: 500 real demonstrations targeting observed failure modes are worth more than 5,000 random demonstrations.

Tip: Use the same recording pipeline and data format as the simulation data to enable seamless combined training.

Tip: Collect under varied lighting conditions that match the deployment environment, not just lab lighting.

5

Fine-Tune the Policy on Real Data

Fine-tune the domain-randomized simulation policy on the real-world demonstrations. Use a learning rate 5-10x lower than the simulation training rate to prevent catastrophic forgetting of simulation-learned skills. Fine-tune all parameters (vision encoder, action head) unless the dataset is very small (<200 demos), in which case freeze the vision encoder.

Two fine-tuning strategies: (1) Real-only fine-tuning: train only on real data. Simple and effective when the sim-to-real gap is moderate. (2) Co-training: mix real data with simulation data (50-50 or 70-30 real-to-sim ratio) to prevent overfitting to the small real dataset while adapting to real-world conditions. Co-training is preferred when the real dataset is small (<500 demos). Monitor validation performance on held-out real episodes — stop when validation loss plateaus or starts increasing.

PyTorchwandb

Tip: Always hold out 10% of real data for validation. Never tune hyperparameters on the training set.

Tip: If fine-tuning degrades performance, the learning rate is too high or the real dataset has quality issues.

Tip: Always hold out 10% of real data for validation. Never tune hyperparameters on the training set.

Tip: If fine-tuning degrades performance, the learning rate is too high or the real dataset has quality issues.

Tip: Monitor per-task success rates during fine-tuning. Some tasks may improve while others regress.

6

Evaluate and Iterate

Deploy the fine-tuned policy on real hardware and evaluate with 100+ episodes per task. Compare against both the simulation-trained baseline (Step 1) and the best simulation performance to quantify how much of the gap was closed. Report: zero-shot success rate (before fine-tuning), fine-tuned success rate, and gap closure percentage.

If the fine-tuned success rate is below target: analyze failure modes again. Common remaining issues: (1) Visual failures on specific objects or conditions — collect more data targeting these conditions. (2) Physics failures on contact-rich subtasks — increase real data for those subtasks. (3) Generalization failures on novel configurations — increase diversity in the fine-tuning dataset. Plan another collection-finetune iteration targeting the identified gaps. Most production deployments require 2-3 iterations of this cycle.

Tip: Record all deployment episodes for future training data — even successful ones provide valuable real-world examples.

Tip: Track metrics over time to detect performance degradation from environment changes (seasonal lighting, new objects, wear on the robot).

Tip: Record all deployment episodes for future training data, even successful ones.

Tip: Track metrics over time to detect performance degradation from environment changes.

Tip: Plan for 2-3 iterations of the evaluate-collect-finetune cycle for production deployments.

Tools & Technologies

MuJoCoNVIDIA Isaac SimPyTorchROS2wandb

References

[1]Tobin et al.. “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.” IROS 2017, 2017. Link
[2]Peng et al.. “Sim-to-Real Transfer of Robotic Control with Dynamics Randomization.” ICRA 2018, 2018. Link
[3]Akkaya et al.. “Solving Rubik's Cube with a Robot Hand.” arXiv 1910.07113, 2019. Link

How Claru Can Help

Claru provides the real-world fine-tuning data that closes the gap domain randomization leaves open. We collect targeted demonstrations on your specific robot hardware in environments that match your deployment conditions, ensuring the fine-tuning data addresses the exact visual and physics gaps your policy faces. Our quality pipeline validates every episode before delivery.

Understanding the Three Components of the Sim-to-Real Gap

The sim-to-real gap is not a single problem but three distinct gaps that compound. The visual gap arises because simulation renderers produce images that differ from real cameras in texture detail, lighting characteristics, lens distortion, and noise patterns. Even the best ray-tracing engines (NVIDIA Omniverse, Unreal Engine 5) produce images that a trained classifier can distinguish from real photos with near-perfect accuracy. The physics gap arises because simulators approximate contact dynamics, friction, deformation, and fluid behavior with simplified models. MuJoCo uses soft-contact models that behave differently from real rigid-body collisions, especially during tight-clearance insertions and multi-finger grasping. The dynamics gap arises because real robot actuators have latency, compliance, backlash, and thermal drift that simulators typically ignore.

Each gap component has different remediation strategies. The visual gap is best addressed by domain randomization (varying textures, lighting, camera parameters) and real-world fine-tuning with RGB data. The physics gap requires system identification (measuring real physical parameters and setting them in simulation) combined with physics randomization over a narrow range. The dynamics gap requires either modeling the actuator dynamics explicitly (adding delays and compliance to the simulated controller) or collecting real-world demonstrations that implicitly teach the policy to compensate. Understanding which gap is limiting your transfer is the first diagnostic step — deploy the sim-trained policy on real hardware, record all failures, and classify each failure as visual, physics, or dynamics.

The Role of Real-World Data in Sim-to-Real Transfer

Real-world demonstrations serve three distinct purposes in sim-to-real transfer, and understanding which purpose you need determines how many demonstrations to collect. Purpose 1: visual grounding — teaching the policy what real-world objects, textures, and lighting look like. For this, 50-200 real images (not full trajectories) may suffice, especially when combined with strong visual augmentation during sim training. Purpose 2: dynamics calibration — teaching the policy how real actuators respond, how real objects slide and tumble, and how real contacts feel. For this, 500-2,000 full demonstrations are needed because the policy must experience the real dynamics over many interaction scenarios. Purpose 3: task-specific fine-tuning — adapting the sim-trained policy to the specific objects, tools, and workspace of the deployment environment. For this, 100-500 demonstrations in the exact deployment setting are typically sufficient.

The most cost-effective approach is staged collection: first collect a small visual grounding set (100 images from the deployment environment), evaluate the sim-trained policy to identify the primary failure mode, then collect targeted demonstrations for the identified gap. If the policy detects objects correctly but grasps poorly, collect dynamics calibration data (focus on varied grasping scenarios). If it grasps well in simulation-like conditions but fails under unusual lighting, collect visual grounding data under the problematic lighting conditions. This targeted approach is 3-5x more data-efficient than uniform random collection.

Frequently Asked Questions

For manipulation tasks with moderate visual complexity: 500-2,000 real demonstrations typically recover 80-95% of simulation performance. For contact-rich tasks (insertion, assembly): 2,000-5,000 demonstrations may be needed because the physics gap is larger. For primarily visual tasks (object detection, sorting): 50-200 real images may suffice. Start with 200-500 real demonstrations, evaluate, and collect more if the performance gap remains above 20%.

Fine-tune the entire policy with a reduced learning rate (1/10th of the simulation training rate). Freezing the vision encoder preserves pretrained visual features but prevents adaptation to real-world visual characteristics. Freezing the action head preserves motor skills but prevents adaptation to real actuator dynamics. Full fine-tuning with a small learning rate allows all components to adapt while staying close to the simulation-trained initialization.

For manipulation: MuJoCo (fast, accurate contact physics, widely supported) or Isaac Sim (GPU-accelerated, photorealistic rendering). For locomotion: Isaac Gym (massive parallelism, fast RL training). For mobile manipulation: Habitat or AI2-THOR (realistic indoor environments). Choose based on your primary task — accurate contact physics matters more for manipulation, while visual fidelity matters more for perception tasks.

Without any domain randomization or fine-tuning, zero-shot transfer from simulation to real hardware typically achieves 0-30% success rate for manipulation tasks — the policy works in simulation but fails on real due to visual and physics differences. With domain randomization but no real data, success rates improve to 40-70% depending on the task complexity and randomization quality. With domain randomization plus 500-2,000 real-world demonstrations for fine-tuning, success rates typically reach 70-90% of the policy's simulation performance. The OpenAI Dactyl project achieved remarkable zero-shot transfer for in-hand cube rotation by using extensive domain randomization and automatic domain randomization (ADR), but most manipulation tasks require some real-world fine-tuning.