Closing the Sim-to-Real Gap with Real-World Data Collection

Q: How much real-world data is needed to close the sim-to-real gap?

Between 500 and 5,000 real-world demonstrations typically close the gap for a specific task domain. The Sim2Real-VLA study showed that 500 real demonstrations improved manipulation success by 18 percentage points over synthetic-only training. The exact volume depends on task complexity, environment diversity, and how well the simulator approximates the target domain. Claru sizes collection volume based on an initial calibration phase that measures transfer error reduction per added demonstration batch.

Q: What types of real-world data are most valuable for sim-to-real transfer?

Contact-rich manipulation data delivers the highest marginal value because contact dynamics are the hardest for simulators to model accurately. Specifically, data showing force profiles during grasping deformable objects, tool-surface interactions, and multi-step assembly sequences addresses the failure modes where simulation diverges most from reality. Claru prioritizes these interaction types through its activity-specific capture pipeline, which produced 12,000+ precisely labeled manipulation clips in a single engagement.

Q: Can sim-to-real transfer work without any real-world data?

Yes, for narrow domains. The Sim2Real-VLA architecture demonstrated zero-shot sim-to-real transfer using vision-language-action models trained exclusively on synthetic data. ABB and NVIDIA's HyperReality system achieved 99% sim-real correlation in calibrated industrial cells. However, both approaches constrain the deployment environment — zero-shot transfer degrades rapidly as environment diversity increases. For general-purpose robotics targeting varied households or workplaces, real-world data remains necessary to cover the distribution that simulation cannot anticipate.

Q: How does Claru's data collection integrate with existing simulation pipelines?

Claru delivers data in formats compatible with standard robotics training frameworks — per-frame image sequences paired with structured metadata (timestamps, activity labels, environment descriptors). Research teams typically use Claru's real-world data in three ways: as a fine-tuning set after simulation pre-training, as a validation set to measure sim-to-real transfer error, or as a calibration set to tune domain randomization parameters. The weekly delivery cadence means teams can begin fine-tuning runs during collection rather than waiting for a complete dataset.

Simulation trains fast but deploys brittle. The gap between rendered physics and physical reality still causes 30-50% performance drops when policies transfer to hardware. Closing that gap requires structured real-world data collected at the exact distribution your simulator cannot reproduce. Claru operates the collection infrastructure that bridges simulation and deployment.

The Domain Gap Is a Data Problem, Not a Compute Problem

The sim-to-real gap refers to the performance degradation that occurs when a policy trained in simulation is deployed on physical hardware. Despite advances in photorealistic rendering and physics engines, simulated environments systematically differ from reality in ways that matter for control: contact dynamics, surface friction coefficients, lighting variation, sensor noise profiles, and object deformation under force. NVIDIA Isaac Sim achieves visual fidelity within 5% of real camera output, yet policies trained exclusively in Isaac Sim still exhibit 30-50% task success rate drops on physical robots due to dynamics mismatches that no renderer can solve [1]. The fundamental issue is distributional: simulation generates data from an approximation of the real world, and the approximation error compounds across long-horizon manipulation tasks where small force errors accumulate into large trajectory deviations.

[1]

Why Domain Randomization Alone Falls Short

Domain randomization — varying textures, lighting, object masses, and friction parameters randomly during training — has been the standard mitigation since 2017. The approach works for perception-heavy tasks (object detection, pose estimation) but degrades for contact-rich manipulation where the randomization range must cover the true physical parameters without being so wide that the policy learns overly conservative behaviors. ABB and NVIDIA's HyperReality system demonstrated 99% correlation between simulated and real sensor readings with 0.5mm positioning accuracy, but achieved this only by constraining the simulation to a narrow, well-calibrated domain — industrial robotic cells with known geometry and materials [2]. Generalizing that calibration to diverse household or workplace environments remains unsolved. The Sim2Real-VLA architecture at ICLR 2026 showed that vision-language-action models trained exclusively on synthetic data can achieve zero-shot real-world transfer, but the paper's own ablation revealed that adding even 500 real-world demonstrations improved manipulation success rates by 18 percentage points over the synthetic-only baseline [3].

[2][3]

What Real-World Data Actually Fixes

Real-world data addresses three specific failure modes that simulation cannot resolve internally. First, contact dynamics: the force profiles generated when a gripper contacts a deformable object (fabric, food, paper) differ from simulated rigid-body or soft-body approximations in ways that depend on material batch, temperature, and humidity — variables that simulation randomizes uniformly but reality distributes non-uniformly. Second, perceptual distribution shift: real kitchens, workshops, and warehouses have lighting, clutter, and occlusion patterns that domain randomization under-represents because the randomization is typically parameterized by engineers who unconsciously bias toward well-lit, moderately cluttered scenes. Third, embodiment-specific dynamics: every physical robot has manufacturing tolerances, joint backlash, and cable routing that create systematic biases absent from its URDF model. Targeted real-world data collection addresses all three by sampling directly from the deployment distribution rather than approximating it.

[1][3]

How Do Current Approaches to Sim-to-Real Transfer Compare?

Four strategies dominate sim-to-real transfer today. Each involves a different data requirement and cost structure. Simulation-only training is cheapest per sample but most brittle at deployment. Full real-world collection is most robust but scales slowly. The hybrid approaches in between vary in how much real data they require and how effectively they use it.

Name	Scale	Tasks	Environments	Limitations
Simulation-Only (Isaac Sim / MuJoCo)	Unlimited synthetic episodes	Locomotion, grasping, navigation	Rendered scenes with domain randomization	30-50% task success drop on real hardware; contact dynamics mismatch; perceptual distribution shift
Sim + Fine-Tuning (Sim2Real-VLA approach)	1M+ synthetic + 500-5K real demos	Manipulation, pick-and-place	Synthetic pre-train, narrow real fine-tune	Real-world data still collected ad-hoc; limited environment diversity; 500 demos minimum threshold
Real-World Only (DROID / Open X-Embodiment)	76K-1M+ episodes across labs	Multi-embodiment manipulation	University labs and controlled settings	Expensive per-episode; limited to lab environments; narrow demographic and geographic coverage
Claru Hybrid Collection	Custom volume, 386K+ clips in prior engagements	Manipulation, locomotion, tool use, workplace tasks	Real kitchens, workshops, barista stations, carpentry shops — 10+ workplace categories	Requires 2-4 week pipeline calibration per new domain

Simulation-Only (Isaac Sim / MuJoCo)

ScaleUnlimited synthetic episodes

TasksLocomotion, grasping, navigation

EnvironmentsRendered scenes with domain randomization

Limitations30-50% task success drop on real hardware; contact dynamics mismatch; perceptual distribution shift

Sim + Fine-Tuning (Sim2Real-VLA approach)

Scale1M+ synthetic + 500-5K real demos

TasksManipulation, pick-and-place

EnvironmentsSynthetic pre-train, narrow real fine-tune

LimitationsReal-world data still collected ad-hoc; limited environment diversity; 500 demos minimum threshold

Real-World Only (DROID / Open X-Embodiment)

Scale76K-1M+ episodes across labs

TasksMulti-embodiment manipulation

EnvironmentsUniversity labs and controlled settings

LimitationsExpensive per-episode; limited to lab environments; narrow demographic and geographic coverage

Claru Hybrid Collection

ScaleCustom volume, 386K+ clips in prior engagements

TasksManipulation, locomotion, tool use, workplace tasks

EnvironmentsReal kitchens, workshops, barista stations, carpentry shops — 10+ workplace categories

LimitationsRequires 2-4 week pipeline calibration per new domain

Game-Based Data Capture for Real-World Simulation

10,000+Hours of synchronized gameplay data

<16msVideo-to-input temporal alignment error

CustomCapture solution built from scratch

0Data loss incidents across all sessions

We designed and built a custom capture application from scratch. The system performs simultaneous screen recording at native resolution and raw input logging, capturing every keystroke, mouse movement, and controller input as structured data with microsecond-precision timestamps. Frame-level alignment between the video and control streams is maintained via a shared monotonic clock, with periodic sync markers to detect and correct any drift.

Read Full Case Study

Egocentric Video Data Collection for Robotics and World Modeling

386K+Total first-person video clips captured

219KGoPro & DJI wearable capture clips

155KSmartphone capture clips

~500Global contributors across 3 pipelines

We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types. The first pipeline deployed GoPro and DJI wearable cameras for high-fidelity, wide-angle egocentric capture of manipulation tasks, cooking, and locomotion — producing 219,000+ clips. The second pipeline used smartphone cameras for rapid, high-volume capture of everyday activities across diverse indoor and outdoor environments — producing 155,000+ clips.

Read Full Case Study

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Between 500 and 5,000 real-world demonstrations typically close the gap for a specific task domain. The Sim2Real-VLA study showed that 500 real demonstrations improved manipulation success by 18 percentage points over synthetic-only training. The exact volume depends on task complexity, environment diversity, and how well the simulator approximates the target domain. Claru sizes collection volume based on an initial calibration phase that measures transfer error reduction per added demonstration batch.

Contact-rich manipulation data delivers the highest marginal value because contact dynamics are the hardest for simulators to model accurately. Specifically, data showing force profiles during grasping deformable objects, tool-surface interactions, and multi-step assembly sequences addresses the failure modes where simulation diverges most from reality. Claru prioritizes these interaction types through its activity-specific capture pipeline, which produced 12,000+ precisely labeled manipulation clips in a single engagement.

Yes, for narrow domains. The Sim2Real-VLA architecture demonstrated zero-shot sim-to-real transfer using vision-language-action models trained exclusively on synthetic data. ABB and NVIDIA's HyperReality system achieved 99% sim-real correlation in calibrated industrial cells. However, both approaches constrain the deployment environment — zero-shot transfer degrades rapidly as environment diversity increases. For general-purpose robotics targeting varied households or workplaces, real-world data remains necessary to cover the distribution that simulation cannot anticipate.

Claru delivers data in formats compatible with standard robotics training frameworks — per-frame image sequences paired with structured metadata (timestamps, activity labels, environment descriptors). Research teams typically use Claru's real-world data in three ways: as a fine-tuning set after simulation pre-training, as a validation set to measure sim-to-real transfer error, or as a calibration set to tune domain randomization parameters. The weekly delivery cadence means teams can begin fine-tuning runs during collection rather than waiting for a complete dataset.

╔════════════════════╗
║  INITIATE CONTACT  ║
║  ▶ CONNECT NOW     ║
╚════════════════════╝

┌────────────────┐
│ STATUS: READY  │
│ AWAITING INPUT │
└────────────────┘

// INITIATE

Your next hire isn't a vendor.
It's a data team.

Tell us what you're training. We'll scope the dataset.

</>

References

[1]NVIDIA Developer Blog. “Closing the Sim-to-Real Gap with NVIDIA Isaac Sim.” NVIDIA Developer, 2025. Isaac Sim achieves photorealistic rendering within 5% of real camera output, but policies still exhibit 30-50% success rate drops on physical hardware due to dynamics mismatches. Link
[2]ABB & NVIDIA. “ABB Robotics Partners with NVIDIA to Deliver Industrial-Grade Physical AI at Scale (RobotStudio HyperReality).” ABB Technology Review, 2025. RobotStudio HyperReality achieves 99% sim-to-real correlation using NVIDIA Omniverse with ABB's virtual controller firmware; Absolute Accuracy technology reduces positioning errors from 8-15mm to 0.5mm. Available H2 2026. Link
[3]Anonymous (under review). “Sim2Real-VLA: Bridging the Sim-to-Real Gap with Vision-Language-Action Models.” ICLR 2026, 2026. Vision-language-action models trained exclusively on synthetic data achieve zero-shot real-world transfer; adding 500 real demonstrations improves manipulation success by 18 percentage points. Link
[4]Khazatsky et al.. “DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.” arXiv 2024, 2024. 76,000 demonstration episodes across 564 scenes and 86 tasks, showing that dataset scale and diversity improve cross-embodiment transfer. Link
[5]Collaboration et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv 2024, 2024. 1M+ episodes across 22 robot embodiments demonstrate that cross-embodiment data improves transfer, but lab-environment bias limits real-world deployment performance. Link