Object Manipulation Training Data

Purpose-built datasets for training robot manipulation policies — from single-object pick-and-place to complex multi-step rearrangement tasks in cluttered environments.

Data Requirements

Modality

RGB + depth + proprioception + force/torque (optional) + language instructions

Volume Range

50-200 per task (single-task) to 100K+ (foundation model)

Temporal Resolution

30 Hz video, 100-500 Hz proprioception

Key Annotations
Grasp type classificationContact event timestampsObject state transitionsTask success/failure with failure taxonomy6-DoF end-effector poseNatural language task descriptions
Compatible Models
RT-2OctoDiffusion PolicyOpenVLAACT/ALOHART-1Pi-zeroHPT
Environment Types
TabletopKitchenWarehouseIndustrial assemblyLaboratoryHome environment

How Claru Supports This Task

Claru operates a distributed network of 10,000+ trained data collectors across 100+ cities, enabling rapid collection of diverse object manipulation demonstrations at scale. Our infrastructure supports multi-view synchronized recording at 30+ Hz with hardware-triggered camera alignment achieving sub-1 ms synchronization, delivering data in RLDS, HDF5, zarr, or WebDataset format. With 386,000+ annotated manipulation clips already in our catalog spanning egocentric video, tabletop manipulation, and multi-environment recordings, we can supplement custom collections with pre-existing data to accelerate your training pipeline. Each demonstration is scored for trajectory quality, temporal synchronization, and task completion, with automated flagging and 25% human spot-verification. We support single-task datasets (50-500 demonstrations, 1-2 week turnaround) through foundation model-scale datasets (100K+ demonstrations, multi-month campaigns with parallel collection sites).

Why Object Manipulation Data Matters

Object manipulation is the foundational capability for nearly every useful robot. Whether a robot is sorting packages in a warehouse, loading a dishwasher, or assembling electronics, it must grasp, transport, and place physical objects with precision and reliability. Training policies for these tasks requires large-scale, diverse demonstration data that captures the full distribution of object geometries, surface properties, lighting conditions, and scene configurations a robot will encounter in deployment.

The challenge is not merely volume but coverage. A manipulation policy trained on 10,000 demonstrations of picking up cubes will fail on cylinders. Research from Google DeepMind's RT-2 (Brohan et al., 2023) showed that scaling to 130,000 demonstrations across hundreds of object categories was necessary to achieve robust generalization. The Open X-Embodiment project (O'Neill et al., 2024) further demonstrated that cross-embodiment data from multiple robot platforms improves manipulation success rates by 50% compared to single-robot datasets. These findings establish that both object diversity and embodiment diversity are necessary for general-purpose manipulation.

Real-world deployment demands data collected in physical environments, not just simulation. While sim-to-real transfer has improved with domain randomization techniques (Tobin et al., 2017), policies trained exclusively on synthetic data still exhibit a 15-30% performance gap on novel objects compared to those trained on real-world demonstrations (James et al., 2019). This gap is especially pronounced for deformable objects, transparent surfaces, and cluttered scenes where physics engines produce unrealistic contact dynamics. The gap persists even with state-of-the-art rendering: transparent objects (bottles, glasses) cause depth sensor failures that simulators do not reproduce, and deformable objects (bags, cables, cloth) have contact physics too complex for real-time simulation.

The economics of manipulation data collection have shifted dramatically with the rise of foundation models. A team building a single-task policy (e.g., 'pick up a specific part from a tray') needs 50-500 demonstrations and can collect them in-house in a few days. A team fine-tuning a foundation model (OpenVLA, Octo, Pi-zero) for a new domain needs 5,000-50,000 demonstrations spanning diverse tasks and objects — a multi-week or multi-month collection effort requiring professional infrastructure. And a team building a proprietary foundation model from scratch needs 100,000-1,000,000+ demonstrations, which is infeasible without distributed collection across many sites and operators. At each scale, the data quality requirements are different: single-task data must be precise but narrow, while foundation model data must be diverse even at the cost of individual demonstration perfection.

Object Manipulation Data at Scale

50K-200K
Demonstrations for robust policies
30 Hz
Typical video capture rate
100+
Object categories needed
386K+
Claru annotated clips available
50%
Success improvement from cross-embodiment data
15-30%
Sim-to-real performance gap on novel objects

Core Data Modalities

camera

RGB Video

Multi-view RGB streams from wrist-mounted and third-person cameras capture visual context, object appearance, and hand-object spatial relationships essential for visuomotor policies.

cube

Depth & Point Clouds

Structured depth maps and 3D point clouds from stereo or LiDAR sensors provide geometric information critical for 6-DOF grasp planning and collision avoidance in cluttered scenes.

activity

Proprioceptive State

Joint positions, velocities, torques, and end-effector poses sampled at 100-500 Hz give the policy access to the robot's internal state for compliant, adaptive manipulation.

tag

Action Labels

End-effector delta poses, joint position targets, or discrete action tokens annotated at the control frequency define the supervision signal for imitation learning.

type

Language Instructions

Natural language task descriptions paired with demonstrations enable language-conditioned policies that generalize to novel instructions without retraining.

Data Collection Approaches

Teleoperation remains the gold standard for collecting manipulation demonstrations. Leader-follower systems like ALOHA (Zhao et al., 2023) enable bimanual data collection at 50 Hz with sub-millimeter positional accuracy. VR-based teleoperation with devices like Meta Quest 3 offers lower hardware costs but introduces latency and kinematic mismatch that can degrade demonstration quality. The choice of interface directly impacts the downstream policy — ACT (Zhao et al., 2023) achieved a 20% higher success rate when trained on leader-follower data compared to VR-collected demonstrations on the same tasks.

Kinesthetic teaching, where a human physically guides the robot, provides natural demonstrations but limits data throughput to roughly 5 demonstrations per hour. In contrast, experienced teleoperators can produce 20-40 demonstrations per hour on tabletop tasks. For large-scale data collection, parallelized teleoperation stations with multiple operators collecting simultaneously can reach throughputs of 200+ demonstrations per day. The DROID project demonstrated this at scale: 90 institutions collecting in parallel produced 76,000 demonstrations in months rather than the years a single lab would require.

Autonomous data collection through scripted policies or reinforcement learning-guided exploration has emerged as a complement to human demonstrations. Google's RT-1 paper (Brohan et al., 2022) used a fleet of 13 robots operating autonomously for 17 months to collect 130,000 demonstrations. However, the resulting data distribution is biased toward already-learned behaviors, making human-collected edge cases essential for pushing policy boundaries. The practical approach is a hybrid: use autonomous collection for high-volume coverage of common manipulation scenarios, and human teleoperation for edge cases, novel objects, and failure recovery demonstrations.

Data quality varies significantly by collection method and has measurable impact on downstream policy performance. Mandlekar et al. (2021) showed that filtering the bottom 20% of demonstrations by trajectory smoothness and task completion time improved Diffusion Policy success rates by 15% compared to training on unfiltered data. This finding has led to the practice of per-demonstration quality scoring: each demonstration is assigned scores for trajectory smoothness (jerk norm), task efficiency (completion time relative to expert baseline), grasp quality (stability after lift), and overall success. Low-scoring demonstrations are either excluded or downweighted during training.

Object Manipulation Data Requirements by Model Architecture

Different model architectures have distinct data volume, observation, and format requirements. This table helps you plan collection based on your target architecture.

ModelMin. DemosObservation SpaceAction SpaceKey Format
RT-2100K+RGB 320x3207-DoF EE deltaRLDS/TFRecord
Octo25K+RGB 256x256 + wrist cam7-DoF EE deltaRLDS
Diffusion Policy100-200 per taskRGB 96x96 multi-viewJoint positionsHDF5/zarr
OpenVLA970K (pretrain)RGB 224x2247-DoF EE deltaRLDS
ACT (ALOHA)50 per taskRGB multi-viewJoint positionsHDF5
Pi-zero10K+ (fine-tune)RGB multi-view + languageFlow-matched actionsCustom/RLDS
HPT200K+ (pretrain)RGB + proprioceptionHeterogeneousRLDS

Quality Requirements and Annotation Standards

High-quality manipulation data requires precise temporal synchronization across all sensor streams. Camera timestamps must be aligned within 5 ms of proprioceptive readings to prevent the policy from learning incorrect visual-action correspondences. At typical manipulation velocities (0.3-1.0 m/s), a 10 ms desynchronization means the visual observation and the corresponding action are spatially offset by 3-10 mm — enough to degrade fine manipulation performance. Claru's data collection infrastructure uses hardware-triggered synchronization to guarantee sub-millisecond alignment across up to 8 simultaneous camera streams, with verification at the start of each collection session.

Annotation standards for manipulation data extend beyond simple task success labels. Rich annotations include grasp type classification (power, pinch, lateral, precision), contact event timestamps (first-contact, stable-grasp, lift-off, pre-place, release), object state changes (picked, transported, placed, stacked, inserted), and failure mode categorization (slip, collision, timeout, wrong object, partial completion). These structured annotations enable filtering and curriculum learning strategies that can improve training efficiency by 30-40% compared to naive random sampling (Mandlekar et al., 2021). For language-conditioned models, each demonstration additionally requires a natural language task description that is both accurate and varied — templated descriptions like 'pick up the red cube' produce worse language grounding than diverse phrasings like 'grab that red block,' 'get the crimson cube,' 'pick up the small red box.'

Data diversity is quantified along multiple axes: object geometry (convex, concave, articulated), material properties (rigid, deformable, granular), scene complexity (isolated, cluttered, occluded), and task variation (pick-place, stack, insert, pour, open, close). A production-grade manipulation dataset should cover at least 100 distinct object instances across 10+ material categories, collected in 5+ distinct environment configurations. The object set must include challenging categories that trip up vision systems: transparent objects (glass, clear plastic), reflective objects (metal, mirrors), thin objects (cards, papers), and small objects (screws, pills). Claru datasets routinely exceed these thresholds with data from over 100 cities worldwide.

Object diversity follows a power law in real-world deployment: 80% of manipulation acts involve 20% of object categories (common items like boxes, bottles, cups), while the remaining 80% of categories appear rarely but critically. Training data must cover the long tail to prevent catastrophic failures on uncommon objects. Claru's object sampling protocol allocates 60% of collection episodes to common objects (ensuring deep coverage) and 40% to long-tail objects (ensuring breadth), with the long-tail set rotated across collection sessions to maximize unique object exposure without sacrificing per-object repetition count.

Key Datasets for General Object Manipulation

Public manipulation datasets vary in scale, diversity, and modality coverage. Understanding these helps identify gaps your custom dataset needs to fill.

DatasetYearScaleObjectsModalitiesAvailability
RT-1/RT-2 (Google)2022-23130K episodesHundreds of kitchen itemsRGB + languageProprietary
DROID202476K demosDiverse (564 tasks)RGB-D + wrist + proprioceptionPublic
Open X-Embodiment20241M+ episodesExtremely diverseMixed (varies by source)Public
Bridge V2202460K demosKitchen + tabletopRGB + wrist + languagePublic
RoboSet2023100K+ trajectoriesKitchen itemsRGB multi-view + proprioceptionPublic
Claru Custom202610K-500K+Custom to spec (100+ categories)Full multi-modalBuilt to spec

Claru Data Delivery Pipeline

1

Requirements Scoping

Define target tasks, object categories, environment specifications, and data format requirements with your research team.

2

Collection Protocol Design

Design operator instructions, quality thresholds, and diversity sampling plans tailored to your manipulation task distribution.

3

Parallel Data Collection

Deploy trained operators across multiple collection sites with standardized hardware and real-time quality monitoring.

4

Annotation & Enrichment

Apply task-specific annotations: grasp types, contact events, object state labels, language descriptions, and success/failure classification with failure taxonomy.

5

Quality Assurance

Automated quality scoring (trajectory smoothness, sync verification, blur detection) plus 25% human spot-verification with inter-annotator agreement tracking.

6

Format & Deliver

Convert to your target format (RLDS, HDF5, zarr, WebDataset) with full metadata, camera calibrations, and stratified train/val/test splits.

References

  1. [1]Brohan et al.. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” arXiv 2307.15818, 2023. Link
  2. [2]O'Neill et al.. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
  3. [3]Zhao et al.. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023, 2023. Link
  4. [4]Brohan et al.. RT-1: Robotics Transformer for Real-World Control at Scale.” RSS 2023, 2022. Link
  5. [5]Chi et al.. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
  6. [6]Khazatsky et al.. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.” RSS 2024, 2024. Link
  7. [7]Mandlekar et al.. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation.” CoRL 2021, 2021. Link

Frequently Asked Questions

The number depends on your model architecture, task complexity, and generalization requirements. Single-task policies using Diffusion Policy or ACT can achieve 80%+ success with 50-200 demonstrations per task, making them ideal for proof of concept and single-application deployments. Multi-task foundation models like Octo require 25,000+ demonstrations across diverse tasks and objects to generalize, while OpenVLA was pretrained on 970,000 demonstrations. Start with 100-200 demonstrations for a single task to validate your pipeline and prove feasibility, then scale based on evaluation metrics. The key insight from recent research is that demonstration diversity (number of unique objects, scenes, and task variations) matters more than repetition count: 5,000 diverse demonstrations outperform 20,000 demonstrations of the same 10 objects on held-out evaluation.

A minimum of two views is recommended: one wrist-mounted camera for close-up hand-object interaction and one third-person overhead or angled camera for scene context. Research shows that dual-view setups improve policy performance by 15-25% over single-view (Zhao et al., 2023). For complex multi-object tasks, adding a second third-person view from a different angle further improves spatial reasoning. The wrist camera should be wide-angle (120+ degree FOV) and low-latency (< 50 ms), with the Intel RealSense D405 being the current standard for wrist-mount RGB-D. Third-person cameras should cover the full workspace at 640x480 minimum resolution. All cameras must be hardware-synchronized to within 5 ms. For foundation model training, include both raw images and calibration data so downstream users can compute 3D correspondences between views.

Both have a place, but real-world data is essential for production deployment. Sim-to-real transfer still shows a 15-30% performance gap on novel objects, with the gap widest for transparent objects (depth sensor failures not modeled in simulation), deformable objects (contact physics too complex for real-time simulation), and cluttered scenes (collision detection artifacts). The recommended approach is to pretrain on simulation data for basic motor primitives and spatial reasoning, then fine-tune on 5,000-20,000 real-world demonstrations covering the deployment object distribution. For foundation model fine-tuning, real data is non-negotiable: the model needs to learn the visual and physical characteristics of real objects, not rendered approximations. Claru specializes in real-world data collection at scale, with 10,000+ trained collectors across 100+ cities.

RLDS (Reinforcement Learning Datasets) is becoming the industry standard for cross-platform compatibility, used by Octo, OpenVLA, RT-X, and the Open X-Embodiment project. It stores episodes as TFRecord files with a standardized schema for observations, actions, rewards, and metadata, and supports efficient streaming for large-scale training. HDF5 is common for single-lab use with Diffusion Policy and ACT, offering simple random access to individual episodes. Zarr offers cloud-native streaming with chunked storage, making it ideal for datasets stored in S3 or GCS. WebDataset provides tar-based sequential access optimized for distributed training. Claru delivers in all four formats with full metadata, sensor calibration data, and provenance tracking. If unsure, start with RLDS for maximum compatibility with the current ecosystem.

Real-world manipulation follows a power law: 20% of object categories account for 80% of manipulation acts, but the remaining 80% of categories include critical items that must not fail (medication bottles, sharp tools, fragile electronics). We address this through structured sampling: 60% of collection episodes cover common objects with deep repetition, and 40% cover long-tail objects rotated across sessions for maximum breadth. Within the long tail, we prioritize categories that challenge vision systems: transparent objects (require depth backup), reflective objects (cause specular highlights), thin/flat objects (hard to grasp from surfaces), deformable objects (unpredictable contact), and small objects (below gripper finger width). Each long-tail category gets a minimum of 50 demonstrations to ensure meaningful representation in the training distribution.

Throughput depends on task complexity and teleoperation interface. For simple pick-and-place tasks (single object, clear workspace), an experienced teleoperator produces 30-40 demonstrations per hour using a leader-follower setup like ALOHA or a SpaceMouse interface. For complex multi-step tasks (stack 3 objects, open a drawer and retrieve an item), throughput drops to 10-20 demonstrations per hour due to longer episode duration and higher failure rates. Parallelizing across multiple stations with dedicated operators, Claru achieves 200-500 demonstrations per day for simple tasks and 100-200 per day for complex tasks. A 10,000-demonstration dataset for foundation model fine-tuning typically requires 4-8 weeks of active collection with 2-4 parallel stations. We track throughput, success rate, and quality scores in real time to optimize collection scheduling.

Get a Custom Quote for Object Manipulation Data

Tell us about your target tasks, robot platform, and data volume needs. We will scope a collection plan and deliver production-ready datasets.