Custom Manipulation Trajectory Data Collection for Robotics

Open manipulation datasets cover broad task distributions but rarely match the specific embodiment, environment, and action-space representation your policy requires. Claru builds custom trajectory datasets from scratch — capturing the exact manipulation behaviors, sensor configurations, and annotation formats that production robotics systems need to generalize beyond the lab.

What Makes Manipulation Trajectory Data So Hard to Collect?

Manipulation trajectory data pairs observation streams (RGB, depth, proprioception) with timestamped action sequences (joint velocities, end-effector poses, gripper states) at control-loop frequency. Collecting this data at scale requires synchronized multi-modal capture, calibrated hardware, and structured annotation of task boundaries, contact events, and success criteria. AgiBot World demonstrated the infrastructure cost: 1 million trajectories across 217 tasks required a 4,000-square-meter facility, 100 robots, and a dedicated engineering team to maintain temporal alignment between camera feeds and joint-state logs [agibot-2025]. Most robotics labs lack this infrastructure entirely. The result is a field where the largest open datasets still cover fewer than 22 robot embodiments [oxe-2023], and labs training policies for new hardware or new tasks face a cold-start problem that no amount of pre-training on mismatched data solves.

[1][3]

Why Does Embodiment Mismatch Degrade Policy Transfer?

DROID collected 76,000 trajectories over 350 hours of interaction, but every trajectory used a single robot: the Franka Emika Panda [droid-2024]. Policies trained on DROID inherit Franka-specific kinematics, gripper geometry, and control-frequency assumptions that do not transfer to other arms without significant fine-tuning. Open X-Embodiment aggregated data from 22 different robots and showed that cross-embodiment transfer is possible in principle — but the dataset's quality variability across contributing labs meant that models trained on the full mixture often underperformed models trained on smaller, higher-quality subsets [oxe-2023]. AgiBot World's GO-1 model achieved a 30% improvement over models trained on Open X-Embodiment data, attributing the gap primarily to consistent capture quality across their controlled facility [agibot-2025]. The pattern is clear: trajectory data must match the target embodiment and maintain consistent quality to produce reliable policies.

[2][3][1]

How Do Task Coverage Gaps Limit Real-World Deployment?

Production manipulation systems encounter task distributions that open datasets were not designed to cover. A warehouse pick-and-place robot handles thousands of SKU geometries; a kitchen assistant robot navigates deformable objects, liquids, and articulated containers. DROID's 76,000 trajectories span tabletop manipulation with rigid objects — a narrow slice of real-world interaction [droid-2024]. AgiBot World covers 217 tasks but within a controlled facility that does not replicate the visual and physical variability of deployment environments [agibot-2025]. Generalist AI (GEN-0) claims 270,000 hours of robotic interaction data generated at 10,000 hours per week, but these figures are company-reported and not peer-reviewed, making independent verification impossible [gen0-2024]. Labs building production systems need trajectory data that matches their specific task distribution, not a generic benchmark.

[2][1][4]

How Do Open Manipulation Datasets Compare to Custom Collection?

The table below compares the four most cited manipulation trajectory sources against Claru's custom collection approach. Scale alone does not determine utility — embodiment match, task coverage, and annotation consistency are the variables that predict policy performance.

AgiBot World

Scale1M+ trajectories, 217 tasks
TasksTabletop manipulation, mobile manipulation, bimanual tasks
Environments4,000 sqm controlled facility, 100 robots
LimitationsSingle facility limits environmental diversity; 5 embodiment types; not publicly available for all tasks

DROID

Scale76K trajectories, 350 hours
TasksTabletop manipulation (rigid objects, limited deformable)
EnvironmentsMultiple labs, but Franka Panda only
LimitationsSingle embodiment (Franka); rigid-object bias; no mobile or bimanual tasks

Open X-Embodiment

Scale1M+ trajectories, 22 robots
TasksBroad but inconsistent — aggregated from 60+ contributing datasets
EnvironmentsHeterogeneous lab settings across contributing institutions
LimitationsQuality variability across labs; inconsistent annotation formats; models trained on full mixture often underperform curated subsets

Claru Custom Collection

Scale386K+ clips (egocentric) + 10,000+ hours (synchronized gameplay)
TasksConfigured per engagement — task taxonomy co-designed with research team
EnvironmentsGlobal contributors across ~500 participants; real-world indoor/outdoor
LimitationsRequires 1-2 week calibration phase per new engagement; not a public benchmark

Egocentric Video Data Collection for Robotics and World Modeling

386K+Total first-person video clips captured
219KGoPro & DJI wearable capture clips
155KSmartphone capture clips
~500Global contributors across 3 pipelines

We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types. The first pipeline deployed GoPro and DJI wearable cameras for high-fidelity, wide-angle egocentric capture of manipulation tasks, cooking, and locomotion — producing 219,000+ clips. The second pipeline used smartphone cameras for rapid, high-volume capture of everyday activities across diverse indoor and outdoor environments — producing 155,000+ clips.

Read Full Case Study

Game-Based Data Capture for Real-World Simulation

10,000+Hours of synchronized gameplay data
<16msVideo-to-input temporal alignment error
CustomCapture solution built from scratch
0Data loss incidents across all sessions

We designed and built a custom capture application from scratch. The system performs simultaneous screen recording at native resolution and raw input logging, capturing every keystroke, mouse movement, and controller input as structured data with microsecond-precision timestamps. Frame-level alignment between the video and control streams is maintained via a shared monotonic clock, with periodic sync markers to detect and correct any drift.

Read Full Case Study
0+

Annotators

0

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Claru supports joint-velocity, end-effector pose (6-DOF position + orientation), and raw control input representations. The specific action space is configured per engagement based on the client's policy architecture. For imitation learning pipelines that consume observation-action pairs, Claru delivers per-frame action labels with microsecond-precision timestamps aligned to the video stream.

Open datasets are free to download but carry hidden costs: fine-tuning to compensate for embodiment mismatch, re-annotating inconsistent labels, and filtering quality-variable subsets. AgiBot World's facility required 100 robots and 4,000 square meters of dedicated space. Claru's distributed collection model avoids facility overhead entirely, and the 1-2 week calibration phase per engagement means production data collection begins within days, not months.

Yes. Claru's capture pipelines are hardware-agnostic at the observation level — GoPro, DJI, smartphone, and custom camera rigs are all supported. For proprioceptive data (joint states, torques), Claru integrates with the client's teleoperation interface or deploys its synchronized capture system, which operates at the OS input layer rather than hooking into specific robot firmware.

Throughput depends on task complexity and annotation requirements. In the egocentric video engagement, Claru produced 386,000 clips across three parallel pipelines with approximately 500 global contributors. The game-based capture engagement produced 10,000 hours of synchronized data. Weekly delivery batches mean collection scales continuously rather than in discrete project phases.

Every submission passes automated validation (resolution, duration, orientation, file integrity) at upload time, followed by human QA review within 24 hours. Inter-annotator agreement is tracked via real-time dashboards, and submissions falling below quality thresholds trigger specific remediation instructions to contributors. The structured activity taxonomy is enforced at the UI level, preventing free-text label drift across the contributor pool.

// INITIATE

Your next hire isn't a vendor. It's a data team.

Tell us what you're training. We'll scope the dataset.

claru@contact ~ READY
CONNECTED
> Initialize consultation request...

Or email us directly at [email protected]

</>

References

  1. [1]AgiBot Team. AgiBot World: A Unified Platform for Scalable and Diverse Robot Learning.” arXiv, 2025. 1M+ trajectories across 217 tasks in a 4,000 sqm facility; GO-1 model achieves 30% improvement over Open X-Embodiment-trained baselines. Link
  2. [2]Khazatsky et al.. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.” arXiv, 2024. 76,000 trajectories over 350 hours of interaction data collected across multiple institutions, but limited to a single robot embodiment (Franka Emika Panda). Link
  3. [3]Open X-Embodiment Collaboration. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv, 2023. 1M+ trajectories from 22 robot embodiments across 60+ datasets; quality variability across contributing labs limits transfer performance on curated subsets. Link
  4. [4]Generalist AI. GEN-0: Building a General-Purpose Robot.” Company Publication, 2024. Claims 270,000 hours of robotic interaction data generated at 10,000 hours per week; figures are company-reported and not peer-reviewed. Link