Stereo Manipulation Dataset

Calibrated stereo camera recordings of manipulation tasks — pick-and-place, assembly, tool use — providing dense depth estimation ground truth for training depth-aware robot policies.

Why This Data Matters for Robotics

The Depth-Aware Manipulation domain represents a critical frontier for robotic perception and autonomous systems. Real-world deployment demands training data captured in authentic environments with the specific sensor modalities, environmental conditions, and task complexity that target applications encounter. Simulation and synthetic data provide useful pre-training signals, but the domain gap between synthetic and real-world stereo data remains a fundamental bottleneck for reliable deployment.

This dataset addresses the gap by providing purpose-collected stereo recordings from real-world environments with dense, human-verified annotations. Every clip captures genuine interactions and conditions — not staged demonstrations or simplified lab setups. The environmental diversity across collection sites ensures trained models generalize to the range of conditions they will encounter in production.

For teams building Depth-Aware Manipulation systems, the annotation quality and density determines the ceiling of model performance. Claru's multi-layer annotation pipeline applies task-specific labels with human verification at every stage, producing training data where annotation accuracy matches the precision requirements of the downstream application.

Dataset at a Glance

60K+

Clips / scans

380+

Hours captured

15+

Collection sites

Annotation layers

Collection Methodology

Claru collectors deploy calibrated stereo sensor rigs in real-world environments following standardized collection protocols. Each session captures continuous recordings across varied conditions — different times of day, weather states, and activity levels — to ensure the dataset covers the full operational distribution of the target application.

Collection sites are selected for diversity across geographic regions, facility types, and environmental conditions. Each site contributes unique characteristics that broaden the training distribution and reduce overfitting to any single environment. Collectors follow facility-specific safety protocols and data handling procedures.

Raw sensor data is captured at full resolution with synchronized metadata including timestamps, sensor calibration parameters, and environmental condition logs. This metadata enables researchers to filter, subset, and augment the data for specific training objectives.

Annotation Layers

📐

Spatial Annotations

Bounding boxes, segmentation masks, or point labels for all objects and regions of interest in each frame or scan. Tracked across temporal sequences for object persistence.

🕐

Temporal Segments

Start/end timestamps for activities, events, and state changes. Enables training temporal reasoning models that understand process sequences and event causality.

🏷️

Semantic Labels

Category and attribute labels for objects, surfaces, and environmental features. Provides the classification ground truth for perception model training.

✅

Quality Indicators

Annotations marking data quality factors: occlusion level, motion blur, sensor artifacts. Enables quality-aware training that weights clean samples appropriately.

How Claru Compares

Dimension	Academic Datasets	Claru
Environment diversity	1-5 locations	15+ sites across regions
Annotation density	1-3 layers	8+ layers, human-verified
Collection conditions	Controlled	Real-world operational
Format flexibility	Single format	Any format (RLDS, HDF5, custom)
Custom collection	Fixed dataset	On-demand expansion

Use Cases and Model Training

Perception models for Depth-Aware Manipulation applications train on this dataset to build robust feature representations that handle the visual complexity and environmental variation of real-world deployment. The multi-layer annotations provide supervision signals for object detection, segmentation, tracking, and scene understanding tasks.

Policy learning systems that use visual observations as input benefit from the dataset's environmental diversity. Models trained on data from 15+ collection sites learn features that transfer across environments rather than memorizing site-specific visual patterns.

Evaluation and benchmarking teams use held-out subsets to measure model performance under realistic conditions. The environmental diversity and condition variation in the dataset enable rigorous evaluation of model robustness that controlled datasets cannot provide.

Frequently Asked Questions

Collection uses calibrated stereo sensors at full resolution with synchronized metadata. Specific sensor models and configurations vary by collection site and are documented in the dataset metadata. Custom sensor configurations can be accommodated for new collection campaigns.

The dataset includes 15+ unique collection sites across multiple geographic regions, covering diverse environmental conditions, layouts, and operational contexts. Each site is documented with facility metadata and environmental condition logs.

Yes. Claru delivers data in any standard format including RLDS, HDF5, WebDataset, zarr, and custom formats. We handle all format conversion and packaging as part of the delivery pipeline.

Related Resources

Glossary

Stereo→

Glossary

Manipulation Trajectory→

Request a Sample Pack

Get a curated sample of this dataset with full annotations to evaluate for your project.

Get in Touch Browse the Data Catalog