Last updated: April 2026

The Sim-to-Real Gap Explained: Why It Happens and How to Close It (2026)

A policy that achieves 95% task success in Isaac Gym can fail completely when deployed on real hardware. This post explains exactly why — and what specific data is needed to address each cause.

TL;DR

The sim-to-real gap has four distinct causes: visual domain gap, physics approximation error, sensor noise mismatch, and long-tail scenario absence — each requires a different mitigation strategy.
Domain randomization (DR) addresses the visual domain gap but cannot fix physics approximation error for contact-rich tasks or generate the long-tail real-world scenarios robots encounter in deployment.
Each cause maps to a specific data requirement: real deployment-environment images (visual gap), real robot interaction logs (physics gap), deployment-hardware sensor data (noise gap), and diverse real-world footage (long-tail gap).
Claru's 500K+ real-world egocentric clips across 100+ cities and diverse environments directly address the long-tail visual scenario gap that simulation cannot generate.

What Is the Sim-to-Real Gap?

The sim-to-real gap is the performance degradation that occurs when a robot policy trained entirely (or primarily) in simulation is deployed on physical hardware. It is one of the central unsolved problems in robotics research, and it gets more severe as tasks become more complex.

Simulation offers compelling advantages for robot training: infinite data, no hardware wear, parallelizable compute, safe exploration of failure modes, and the ability to reset environments instantly. Isaac Gym and its successor Isaac Lab from NVIDIA can run thousands of parallel simulation environments simultaneously, enabling reinforcement learning training that would take years in the real world to complete in hours. MuJoCo, PyBullet, and Webots are similarly used across manipulation research.

The problem is that simulation is a model, and every model has approximation errors. When a policy trained in simulation encounters the real world, it faces inputs it was never trained on — real physics, real lighting, real sensor noise, and real situations that the simulation did not generate. The policy was trained to be optimal for the simulated world, not for the real one.

The gap is not one thing. It is four distinct phenomena with different causes and different mitigations. Treating it as a single problem is why many sim-to-real transfer attempts fail: a team applies domain randomization to address the visual gap, sees no improvement, and concludes that sim-to-real doesn't work — when in reality they were ignoring the physics approximation or sensor noise gaps that were the actual failure modes.

Cause 1: Visual Domain Gap

The visual domain gap is the most frequently cited and best-studied cause of sim-to-real failure. Simulated rendering engines — including Unreal Engine, Unity, and the physics-coupled renderers in Isaac Lab — produce images that are visually distinguishable from real camera output by even casual inspection.

The specific differences include:

Lighting models: Simulation uses idealized global illumination; real environments have complex indirect lighting, ambient occlusion, and time-varying natural light.
Texture and material properties: Simulated textures are static maps; real surfaces have scratches, stains, wear patterns, and view-dependent reflectance.
Depth cues: Real depth cameras have characteristic noise patterns (structured light speckle, time-of-flight multi-path interference) absent from simulated depth maps.
Camera artifacts: Lens distortion, chromatic aberration, and motion blur are absent from most simulation renderers.

Domain randomizationis the primary technique for addressing the visual gap. Introduced by OpenAI in 2017 and refined substantially since, DR trains policies across randomized textures, lighting parameters, camera positions, and object colors — forcing the model to learn features that are invariant to these variations. OpenAI's Dactyl results showed DR enabling sim-to-real transfer for dexterous in-hand manipulation without any real-world training data.

However, DR adds a different type of noise: the policy must be invariant to variations that are physically unrealistic. A table in Dactyl's training had randomized textures including checker-boards and gradients that no real table would exhibit. This unrealistic variation can hurt performance on the actual deployment environment. The practical approach in 2026 is to combine domain randomization with real-world image data from the deployment environment — using DR to avoid overfitting to simulation while using real images to anchor the distribution.

Cause 2: Physics Approximation Error

Physics simulation approximates reality at the level of contact mechanics, friction, and deformable body dynamics. For rigid-body manipulation in constrained environments (e.g., pick-and-place with simple geometry), the approximations are reasonable. For contact-rich manipulation — assembly, folding, pouring, cutting — they diverge significantly.

MuJoCo (Multi-Joint dynamics with Contact) uses a convex optimization approach to contact forces that produces stable simulation but does not match real contact dynamics for soft materials. Isaac Gym's PhysX engine handles rigid bodies well but has well-documented limitations for deformable objects and slip-contact interactions. Neither simulator accurately models the behavior of a compliant parallel gripper grasping a soft object like a foam cup or a piece of fruit.

Domain randomization over physics parameters (mass, friction coefficients, joint damping) partially compensates for this by forcing policies to handle a range of dynamics rather than a single simulated point estimate. But there is a systematic limit: if the real contact physics falls outside the randomization range — which it often does for novel object geometries or materials — the policy trained with randomized physics will still fail.

The only reliable fix for physics approximation error is real interaction data: robot demonstrations or teleoperation logs that capture actual contact dynamics, including failure cases and recovery behaviors. This is why even simulation-heavy pipelines like GR00T N1's training include real robot demonstration data — the simulation provides breadth and scale, while the real data provides physics accuracy.

Cause 3: Sensor Noise Mismatch

Real sensors have characteristic noise profiles that are distinct from the Gaussian or uniform noise models commonly used in simulation. When a policy is trained on simulated sensor data with simplistic noise models and then deployed with real hardware, it encounters systematic signal patterns it was not trained to handle.

Specific examples:

Structured-light depth cameras (Intel RealSense): The speckle noise pattern of structured-light projection creates characteristic depth estimation errors at edges and on reflective surfaces that simple additive Gaussian noise does not model.
Proprioceptive sensors: Real joint encoders have backlash, quantization noise, and drift that accumulate over manipulation sequences. Simulated proprioception is typically noiseless or idealized.
Tactile sensors: Force-torque sensors and tactile arrays (like the ones on Shadow Hand or DIGIT fingertips) have complex noise characteristics that are not meaningfully capturable in simulation without hardware-specific data.

The mitigation is direct: collect data with the exact hardware that will be used in deployment. A policy trained on data from a Franka FT300-S force-torque sensor does not automatically transfer to a Robotiq FT 300 sensor, even though both measure end-effector forces and torques. Hardware-specific sensor calibration data is the only reliable way to close this gap.

Cause 4: Long-Tail Scenario Absence

Simulation generates scenarios from a designed distribution. Real environments contain everything that distribution did not anticipate: a cup on a tilted surface, a person walking past during manipulation, an object that looks like a training object but is made of a different material, a workspace that is partially occluded by something that was not there during collection.

This long-tail failure mode is arguably the hardest to address through simulation alone, because it requires anticipating the unanticipated. Procedural generation can add surface clutter and varied object positions, but it cannot generate the genuinely unexpected. The more diverse the deployment environment — kitchen versus lab versus warehouse versus outdoor construction site — the larger this gap becomes.

The implication for data: long-tail coverage requires real-world data collected at scale across diverse environments and conditions. Large-scale egocentric video datasets provide exactly this — they capture the visual diversity of real human manipulation across hundreds of location types, lighting conditions, object configurations, and task contexts that simulation cannot enumerate.

Claru's 500K+ egocentric clips, captured by 10,000+ contributors across 100+ cities on 6 continents, cover the visual long-tail of real manipulation environments. Each clip captures a genuine human in a genuine environment performing genuine tasks — the kind of distributional diversity that simulation cannot procedurally generate. For robotics teams training models that will deploy into unstructured real-world environments, this egocentric data directly addresses the long-tail scenario gap that domain randomization and simulation cannot.

Causes and Mitigations: Summary Table

Cause	DR Helps?	Real Data Required
Visual domain gap	Yes — texture/lighting randomization reduces this	Real-world images or video from target deployment environment
Physics approximation error	Partially — mass/friction randomization helps but can't fix systematic contact model errors	Real robot interaction data showing actual contact dynamics and recovery behaviors
Sensor noise mismatch	Partially — additive Gaussian noise is poor model for depth speckle or IMU drift	Data collected with exact deployment sensor hardware, not simulated equivalents
Long-tail scenario absence	No — procedural generation cannot enumerate unknown unknowns	Diverse real-world footage across many environments and unexpected configurations

Why Domain Randomization Is Not Enough

Domain randomization achieved a real milestone with Dactyl — successfully transferring a dexterous in-hand manipulation policy from MuJoCo simulation to a physical Shadow Dexterous Hand with no real-world training data. This result is frequently cited as proof that simulation is sufficient for complex manipulation.

It is important to understand what Dactyl actually demonstrated. The task (rotating a Rubik's cube to a target face configuration) is a continuous control problem over a rigid object with relatively predictable contact mechanics. It does not involve deformable objects, novel materials, environmental clutter, or human interaction. Domain randomization can close the sim-to-real gap for tasks with bounded visual and physical variation — and Dactyl is that type of task.

For tasks that do involve deformable objects, diverse environments, or real-world clutter, DR alone is insufficient. The 2023 results from Physical Intelligence and others on deformable object manipulation (laundry, food packaging, cable management) all required real robot demonstration data precisely because DR could not handle the contact physics divergence for these materials.

The emerging consensus in 2026 is a hybrid approach: use simulation + DR for maximum data volume and safe exploration of failure modes, then supplement with real-world data to close the residual gaps. The ratio depends on task complexity — for rigid pick-and-place, sim-to-real with DR can work with minimal real data; for dexterous manipulation of deformable objects in diverse environments, the real data component is substantial.

The Role of Real-World Data

Different causes of the sim-to-real gap require different types of real-world data. Understanding this mapping helps teams prioritize collection efforts rather than collecting large amounts of data that address the wrong gap.

Visual gap → Real-world images and video from deployment environments

Capturing images and video from the actual deployment spaces (specific kitchen, warehouse, laboratory) with the same cameras used on the robot. Even a few hundred frames of the real environment can significantly improve visual transfer.

Physics gap → Real robot interaction demonstrations

Teleoperation demonstrations or kinesthetic teaching of the target task with the deployment robot hardware. These captures encode actual contact dynamics and are not substitutable with simulation data for contact-rich tasks.

Sensor noise gap → Hardware-calibrated sensor data

Recordings from the exact sensor hardware used in deployment, across the range of conditions the robot will encounter. This is the most hardware-specific data requirement and cannot be generalized across sensor models.

Long-tail gap → Diverse real-world egocentric video at scale

Large-scale first-person video from many environments, lighting conditions, object configurations, and task contexts. This is precisely what Claru's 500K+ egocentric clips across 100+ cities and 6 continents address — the visual diversity of real manipulation environments that simulation cannot enumerate.

Key Takeaways

The sim-to-real gap is not a single phenomenon — it has four distinct causes (visual, physics, sensor noise, long-tail), each requiring a different mitigation.
Domain randomization effectively addresses the visual domain gap but cannot fix physics approximation errors for contact-rich manipulation or generate long-tail real-world scenarios.
For rigid-body pick-and-place tasks, sim + DR can achieve reliable real-world transfer with minimal real data. For dexterous manipulation of deformable objects, real interaction data is required.
Physics approximation errors are most severe for soft materials, compliant grippers, and contact-rich assembly — tasks that are increasingly important for humanoid robots.
The long-tail gap requires diverse real-world data at scale, not more simulation. This is the specific gap that large-scale egocentric video collections address.
Practical 2026 approach: simulation + DR for data volume, real robot demonstrations for contact physics accuracy, real environment imagery for visual grounding, and large-scale egocentric video for long-tail distributional coverage.

Frequently Asked Questions

What causes the sim-to-real gap?

The sim-to-real gap arises from four distinct sources of mismatch between simulation and the physical world: (1) Visual domain gap — simulated rendering differs from real camera images in lighting, texture, reflection, and depth cues; (2) Physics approximation error — simulators like MuJoCo and Isaac Gym use simplified contact models and friction approximations that diverge from real material interactions, especially for deformable objects and soft contacts; (3) Sensor noise mismatch — real sensors have specific noise profiles (depth sensor speckle, camera motion blur, proprioceptive drift) that are absent or poorly modeled in simulation; (4) Long-tail scenario absence — rare events, unexpected object configurations, and edge-case environments that appear in real deployment are not represented in simulation because they are difficult to enumerate and procedurally generate.

Can domain randomization eliminate the sim-to-real gap?

Domain randomization (DR) substantially reduces the visual domain gap by training policies on randomized textures, lighting, and camera parameters. OpenAI's Dactyl demonstrated that DR could transfer dexterous manipulation policies from MuJoCo to physical Shadow Dexterous Hand systems. However, DR has three well-documented limitations: it cannot address physics approximation error for contact-rich tasks (because randomizing friction and mass parameters doesn't capture the systematic errors in contact models), it does not generate long-tail real-world scenarios that aren't in the randomization distribution, and aggressive DR can degrade policy performance on the nominal task by forcing the model to handle physically unrealistic variations. Real-world data remains necessary to supplement DR for production deployment.

What data do I need to close the sim-to-real gap?

Each cause of the sim-to-real gap requires different data to address: For the visual domain gap, real-world images or video from the deployment environment with diverse lighting, time-of-day variation, and realistic object appearances. For physics approximation error, real robot interaction data — teleoperation or demonstrations — that captures actual contact dynamics, including failures and recoveries that reveal the real physics. For sensor noise, data collected with the exact sensor hardware (depth camera model, proprioception sensors) that will be used in deployment, not a simulated approximation. For long-tail scenarios, diverse real-world footage from many environments and object configurations — this is where egocentric video at scale is useful, as it captures the visual diversity of real-world environments across many locations, times, and conditions.

Related Resources

Sim-to-Real Data — Claru

How Claru's real-world egocentric data addresses the visual domain gap and long-tail scenario absence.

Training Data for Robotics

Overview of physical AI training data requirements across perception, policy, and world model layers.

Glossary: Sim-to-Real Gap

Formal definition with references to key papers on domain randomization and transfer learning.