Teleoperation Dataset Collection for Robot Learning

Q: What is the difference between teleoperation data and human demonstration data?

Teleoperation data captures state-action pairs where a human directly controls a robot, producing trajectories in the robot's own action space. Human demonstration data captures the same tasks performed by humans without robot mediation, typically as video. Teleoperation data can be consumed directly by imitation learning algorithms; human demonstration data requires cross-embodiment transfer or action extraction. Claru specializes in scaled human demonstration collection, which provides the behavioral diversity that complements smaller teleoperation datasets.

Q: How many teleoperation demonstrations does a robot policy typically need?

Between 50 and 5,000 demonstrations per task, depending on the algorithm and task complexity. HumanPlus achieved 60-100% success with 40 hours of training data across diverse tasks. DROID collected 76,000 episodes for multi-task generalization. Claru's approach reduces the number of robot-in-the-loop demonstrations needed by providing broad human demonstration coverage — labs typically need 5-10x fewer robot demonstrations when pre-training on diverse human video data.

Q: Can human video demonstrations replace robot teleoperation entirely?

Not yet for precision manipulation tasks requiring force-torque feedback. HumanPlus and ACE demonstrate that camera-based systems can retarget human poses to robots with 60-100% success on gross motor tasks, but contact-rich manipulation (insertion, fastening, deformable object handling) still benefits from direct robot-in-the-loop teleoperation. Claru's data serves as a complement — providing the environmental and behavioral diversity at scale — while targeted teleoperation provides the embodiment-specific precision for the final 5-10% of task performance.

Q: What hardware and format does Claru deliver teleoperation-adjacent data in?

Claru captures with GoPro and DJI wearable cameras (219K clips at high-fidelity wide-angle) and smartphones (155K clips at 4K/60fps). Data is delivered as per-frame image sequences with structured metadata: timestamps, activity labels, environment descriptors, interaction complexity scores, and object counts. The format is compatible with standard robotics training frameworks including RT-X, Octo, and OpenVLA data loaders.

Teleoperation generates the highest-quality demonstration data for robot learning — a human operator directly controlling the robot produces state-action pairs that imitation learning algorithms consume without inference. But current teleoperation pipelines produce fewer than 200 demonstrations per day and require $50K-150K hardware rigs per station. Claru scales demonstration collection by deploying managed contributor networks across real-world environments, capturing the behavioral diversity that single-lab teleoperation cannot provide.

The Scale Problem: Labs Produce Fewer Than 200 Demos per Day

Teleoperation is the gold standard for robot demonstration data because it captures the exact state-action mapping that imitation learning requires — no inverse dynamics inference, no reward engineering, no sim-to-real transfer. Yet even the most advanced teleoperation systems generate less than 0.1% of the state-action space a general-purpose policy needs. A typical lab setup produces 50-200 demonstrations per day depending on task complexity and reset time [1]. The DROID dataset — one of the largest teleoperation efforts to date — collected 76,000 episodes across 564 scenes, but this required coordinated effort across 13 institutions over 18 months [4]. Scaling teleoperation within a single lab hits hard constraints: operator fatigue limits sessions to 2-4 hours, hardware maintenance creates downtime, and a single location provides no environmental diversity.

[1][4]

The Hardware Barrier: $50K-150K per Teleoperation Station

Traditional teleoperation requires specialized hardware that creates a steep cost curve. Exoskeleton-based systems (Gello, ALOHA) cost $50K-150K per station including the leader arm, follower robot, and instrumentation. VR-based approaches have reduced the interface hardware burden — Open-TeleVision uses a cross-platform web interface supporting Apple Vision Pro and Meta Quest for stereoscopic immersive teleoperation [3] — but still require the follower robot hardware at each collection site. The ACE system reduced the vision component to a single hand-facing camera for 3D hand pose estimation, enabling cross-platform teleoperation without per-robot calibration [2]. HumanPlus went further, using a single $50 RGB camera to track full-body human poses and retarget them to a humanoid robot, achieving 60-100% success rates across tasks after just 40 hours of training data [1]. Despite these advances, every approach still requires a physical robot at the collection site — and the robot is the bottleneck, not the interface.

[1][2][3]

The Diversity Problem: Lab Environments Do Not Represent Deployment

Even with cheaper interfaces and faster collection rates, lab-based teleoperation produces data from a narrow environmental distribution. A teleoperation dataset collected in 3 university kitchens does not capture the lighting variation, clutter density, counter heights, or tool configurations of the thousands of real kitchens where a home robot would operate. The same policy that achieves 85% success in the training kitchen may drop to 40% in a novel kitchen with different cabinet layouts and unfamiliar utensils. This distribution gap is distinct from the sim-to-real gap — it exists entirely within the real world, between the data collection environment and the deployment environment. Addressing it requires collecting demonstrations across hundreds of distinct physical locations, which is operationally infeasible for any single lab.

[4][5]

How Do Teleoperation Data Collection Approaches Compare?

Four distinct approaches to teleoperation data collection have emerged, each making different trade-offs between data quality, collection speed, hardware cost, and environmental diversity. The choice depends on whether the downstream task prioritizes precise force-torque trajectories or broad behavioral coverage across environments.

Name	Scale	Tasks	Environments	Limitations
VR-Based Teleoperation (Open-TeleVision)	50-200 demos/day per station	Bimanual manipulation, object handoff	Single lab per VR station	Cross-platform VR interface but requires follower robot at each site; immersive but no haptic feedback; limited to robot's workspace
Exoskeleton Systems (ALOHA, Gello)	50-150 demos/day per station	Dexterous manipulation, insertion	Single lab per hardware rig	$50K-150K per station; operator fatigue limits sessions to 2-4 hours; no environment diversity
Camera-Based Shadowing (HumanPlus, ACE)	100-300 demos/day per operator	Full-body humanoid control, hand manipulation	Any environment with camera setup	Single $50 camera (HumanPlus) but still needs robot for deployment; 40 hours training data minimum; cross-embodiment transfer unproven at scale
Claru Managed Collection Network	386K+ clips across prior engagements; ~500 global contributors	Manipulation, locomotion, tool use, workplace tasks across 10+ categories	Real kitchens, workshops, barista stations, carpentry shops, retail — multi-country coverage	Captures human demonstrations (not robot state-action pairs); requires post-processing for robot policy training

VR-Based Teleoperation (Open-TeleVision)

Scale50-200 demos/day per station

TasksBimanual manipulation, object handoff

EnvironmentsSingle lab per VR station

LimitationsCross-platform VR interface but requires follower robot at each site; immersive but no haptic feedback; limited to robot's workspace

Exoskeleton Systems (ALOHA, Gello)

Scale50-150 demos/day per station

TasksDexterous manipulation, insertion

EnvironmentsSingle lab per hardware rig

Limitations$50K-150K per station; operator fatigue limits sessions to 2-4 hours; no environment diversity

Camera-Based Shadowing (HumanPlus, ACE)

Scale100-300 demos/day per operator

TasksFull-body humanoid control, hand manipulation

EnvironmentsAny environment with camera setup

LimitationsSingle $50 camera (HumanPlus) but still needs robot for deployment; 40 hours training data minimum; cross-embodiment transfer unproven at scale

Claru Managed Collection Network

Scale386K+ clips across prior engagements; ~500 global contributors

TasksManipulation, locomotion, tool use, workplace tasks across 10+ categories

EnvironmentsReal kitchens, workshops, barista stations, carpentry shops, retail — multi-country coverage

LimitationsCaptures human demonstrations (not robot state-action pairs); requires post-processing for robot policy training

Egocentric Video Data Collection for Robotics and World Modeling

386K+Total first-person video clips captured

219KGoPro & DJI wearable capture clips

155KSmartphone capture clips

~500Global contributors across 3 pipelines

We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types. The first pipeline deployed GoPro and DJI wearable cameras for high-fidelity, wide-angle egocentric capture of manipulation tasks, cooking, and locomotion — producing 219,000+ clips. The second pipeline used smartphone cameras for rapid, high-volume capture of everyday activities across diverse indoor and outdoor environments — producing 155,000+ clips.

Read Full Case Study

Workplace Egocentric Video Data for General-Purpose Robotics

10Distinct workplace categories captured on-site

4K/60fpsCapture resolution via standard smartphones

Multi-countryGeographic coverage across global locations

<48hContributor onboarding time per business

We embedded data capture directly into real-world business operations across multiple countries and 10 workplace categories. Business owners and workers were onboarded as contributors through a lightweight side-revenue model that kept participation voluntary and minimally disruptive to normal workflow. Workplace categories spanned food service (barista, cooking), skilled trades (carpentry, tailoring, screen printing), repair services (phone repair, tool repair), textile work (clothing shop, ironing), and assembly (furniture assembly, paper cutting).

Read Full Case Study

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Teleoperation data captures state-action pairs where a human directly controls a robot, producing trajectories in the robot's own action space. Human demonstration data captures the same tasks performed by humans without robot mediation, typically as video. Teleoperation data can be consumed directly by imitation learning algorithms; human demonstration data requires cross-embodiment transfer or action extraction. Claru specializes in scaled human demonstration collection, which provides the behavioral diversity that complements smaller teleoperation datasets.

Between 50 and 5,000 demonstrations per task, depending on the algorithm and task complexity. HumanPlus achieved 60-100% success with 40 hours of training data across diverse tasks. DROID collected 76,000 episodes for multi-task generalization. Claru's approach reduces the number of robot-in-the-loop demonstrations needed by providing broad human demonstration coverage — labs typically need 5-10x fewer robot demonstrations when pre-training on diverse human video data.

Not yet for precision manipulation tasks requiring force-torque feedback. HumanPlus and ACE demonstrate that camera-based systems can retarget human poses to robots with 60-100% success on gross motor tasks, but contact-rich manipulation (insertion, fastening, deformable object handling) still benefits from direct robot-in-the-loop teleoperation. Claru's data serves as a complement — providing the environmental and behavioral diversity at scale — while targeted teleoperation provides the embodiment-specific precision for the final 5-10% of task performance.

Claru captures with GoPro and DJI wearable cameras (219K clips at high-fidelity wide-angle) and smartphones (155K clips at 4K/60fps). Data is delivered as per-frame image sequences with structured metadata: timestamps, activity labels, environment descriptors, interaction complexity scores, and object counts. The format is compatible with standard robotics training frameworks including RT-X, Octo, and OpenVLA data loaders.

╔════════════════════╗
║  INITIATE CONTACT  ║
║  ▶ CONNECT NOW     ║
╚════════════════════╝

┌────────────────┐
│ STATUS: READY  │
│ AWAITING INPUT │
└────────────────┘

// INITIATE

Your next hire isn't a vendor.
It's a data team.

Tell us what you're training. We'll scope the dataset.

</>

References

[1]Fu et al.. “HumanPlus: Humanoid Shadowing and Imitation from Humans.” arXiv 2024, 2024. Single $50 RGB camera tracks full-body human poses and retargets to a humanoid robot, achieving 60-100% task success after 40 hours of training data. Link
[2]Wang et al.. “ACE: A Cross-Platform Visual-Exoskeletonless Teleoperation System.” arXiv 2024, 2024. Hand-facing camera estimates 3D hand poses for teleoperation without per-robot calibration, enabling cross-platform demonstration collection. Link
[3]Cheng et al.. “Open-TeleVision: Teleoperation with Immersive Active Visual Feedback.” CoRL 2024, 2024. Immersive VR teleoperation system using stereoscopic video streaming and hand/wrist pose mirroring; cross-platform interface supports Apple Vision Pro, Meta Quest, and browser-based access, validated on four long-horizon tasks with two humanoid robots. Link
[4]Khazatsky et al.. “DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.” arXiv 2024, 2024. 76,000 demonstration episodes across 564 scenes and 86 tasks collected across 13 institutions over 18 months, illustrating the multi-lab coordination required for large-scale teleoperation. Link
[5]Collaboration et al.. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv 2024, 2024. 1M+ episodes across 22 robot embodiments show that cross-embodiment data improves transfer, but collection remains bottlenecked by hardware availability and lab environment diversity. Link

Teleoperation Dataset Collection for Robot Learning

The Scale Problem: Labs Produce Fewer Than 200 Demos per Day

The Hardware Barrier: $50K-150K per Teleoperation Station

The Diversity Problem: Lab Environments Do Not Represent Deployment

How Do Teleoperation Data Collection Approaches Compare?

VR-Based Teleoperation (Open-TeleVision)

Exoskeleton Systems (ALOHA, Gello)

Camera-Based Shadowing (HumanPlus, ACE)

Claru Managed Collection Network

Egocentric Video Data Collection for Robotics and World Modeling

Workplace Egocentric Video Data for General-Purpose Robotics

Frequently Asked Questions

Your next hire isn't a vendor. It's a data team.

References

Related Solutions

Your next hire isn't a vendor.
It's a data team.