Depth Sensing Training Data: RGB-D and Depth Map Datasets for Robot Perception

Depth perception is the foundation of robot spatial reasoning: grasp planning, obstacle avoidance, and navigation all depend on accurate 3D scene understanding. Yet most depth datasets are captured in controlled indoor environments with commodity sensors, producing models that fail on the reflective surfaces, transparent objects, and outdoor lighting that real deployments demand.

Why Does Depth Perception Remain a Bottleneck for Robot Deployment?

Modern robot manipulation and navigation systems depend on depth sensing for spatial reasoning, yet depth estimation models trained on existing datasets exhibit systematic failure modes in real deployments. Depth Anything V2 achieved state-of-the-art monocular depth estimation by training on large-scale synthetic data combined with real-world images, demonstrating that scale and diversity in training data are the primary drivers of depth model quality. However, the authors noted that performance degrades significantly on scenes with transparent objects, specular surfaces, and thin structures that are underrepresented in training data. For robot manipulation, these are not edge cases: warehouse inventory includes transparent bottles, reflective packaging, and thin-walled containers. UniDepth introduced a self-supervised framework for universal monocular depth estimation that generalizes across camera intrinsics, but acknowledged that accuracy depends on the diversity of real-world scenes in the training set.

[1][2]

What Gaps Exist in Current Depth Datasets for Robotics?

Existing depth datasets fall into two categories with distinct limitations. Synthetic datasets like Hypersim and Virtual KITTI provide perfect ground-truth depth but exhibit sim-to-real gaps in material properties, lighting, and sensor noise. Real-world datasets like ScanNet and NYU Depth V2 capture actual indoor scenes but are limited in scale, environment diversity, and sensor quality. ScanNet provides RGB-D scans of 1,513 indoor rooms, making it the largest real-world indoor depth dataset, but all scenes use a single sensor type in residential and office environments. GraspNet-1Billion pairs depth data with grasp annotations across 88 objects but in controlled tabletop setups with uniform lighting. Neither category provides the combination of real-world capture, diverse environments, and task-specific annotations that robot depth perception requires for reliable deployment.

[3][4]

How Do Sensor-Specific Artifacts Limit Depth Data Utility?

Different depth sensors produce different artifacts: structured-light sensors fail on reflective and dark surfaces, time-of-flight sensors suffer from multi-path interference, and stereo cameras lose accuracy at range and in textureless regions. Depth Anything V2 showed that training on diverse sensor data improves model robustness to sensor-specific noise, but existing datasets predominantly use a single sensor type. A depth model trained on Kinect data from NYU Depth V2 will exhibit systematic errors when deployed with a RealSense camera or stereo pair on a robot. For robot applications where depth accuracy directly determines grasp success or collision avoidance reliability, sensor-specific training data that matches the deployment sensor configuration is essential.

[1]

How Do Open Depth Datasets Compare to Custom Collection?

The table below compares major depth sensing datasets against Claru custom collection. Key gaps in open data include environment diversity, sensor variety, and pairing with task-specific annotations for robot applications.

Name	Scale	Tasks	Environments	Limitations
NYU Depth V2	1,449 labeled RGB-D frames, 464 scenes	Indoor scene understanding, depth estimation	Indoor residential and office environments	Small scale; single Kinect sensor; limited scene diversity; no robot-task annotations
ScanNet	2.5M RGB-D frames, 1,513 rooms, 21K objects	3D scene understanding, semantic segmentation, depth completion	Indoor rooms; residential and office	Single sensor type; indoor-only; no outdoor or industrial environments; no manipulation annotations
Depth Anything V2 Training Mix	62M synthetic + real images	Monocular depth estimation	Diverse web-scale scenes; synthetic and real	No robot-specific annotations; no paired action data; depth as auxiliary task not primary; no sensor-specific calibration
GraspNet-1Billion (Depth)	256 RGB-D scenes, 88 objects	6-DoF grasp detection with depth data	Lab tabletop with controlled lighting	88 objects only; controlled conditions; no reflective, transparent, or deformable objects
Claru Custom	386K+ video clips with depth enrichment, configurable sensor configurations	Configurable: depth-paired manipulation, navigation, grasp planning, obstacle detection in real deployment environments	Real warehouses, homes, workplaces, outdoor; diverse lighting and surface conditions	Requires engagement lead time (days to launch, 1-2 week calibration); not a public benchmark

NYU Depth V2

Scale1,449 labeled RGB-D frames, 464 scenes

TasksIndoor scene understanding, depth estimation

EnvironmentsIndoor residential and office environments

LimitationsSmall scale; single Kinect sensor; limited scene diversity; no robot-task annotations

ScanNet

Scale2.5M RGB-D frames, 1,513 rooms, 21K objects

Tasks3D scene understanding, semantic segmentation, depth completion

EnvironmentsIndoor rooms; residential and office

LimitationsSingle sensor type; indoor-only; no outdoor or industrial environments; no manipulation annotations

Depth Anything V2 Training Mix

Scale62M synthetic + real images

TasksMonocular depth estimation

EnvironmentsDiverse web-scale scenes; synthetic and real

LimitationsNo robot-specific annotations; no paired action data; depth as auxiliary task not primary; no sensor-specific calibration

GraspNet-1Billion (Depth)

Scale256 RGB-D scenes, 88 objects

Tasks6-DoF grasp detection with depth data

EnvironmentsLab tabletop with controlled lighting

Limitations88 objects only; controlled conditions; no reflective, transparent, or deformable objects

Claru Custom

Scale386K+ video clips with depth enrichment, configurable sensor configurations

TasksConfigurable: depth-paired manipulation, navigation, grasp planning, obstacle detection in real deployment environments

EnvironmentsReal warehouses, homes, workplaces, outdoor; diverse lighting and surface conditions

LimitationsRequires engagement lead time (days to launch, 1-2 week calibration); not a public benchmark

Annotators

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Claru provides three categories of depth data: hardware-captured RGB-D using LiDAR-equipped devices (iPhone Pro, iPad Pro, RealSense), AI-enriched depth maps generated from monocular RGB video using state-of-the-art depth estimation models, and hybrid datasets combining both sources with uncertainty estimates. All depth data is paired with task-specific annotations for robot applications.

Yes. Claru configures depth capture to match the sensor type, resolution, and field of view of the target robot platform. For deployments using RealSense, ZED, or custom stereo rigs, collection protocols are designed to produce depth data with sensor-matched noise characteristics and range profiles. Output formats match the point cloud or depth map representations your perception pipeline expects.

Claru explicitly targets the depth failure modes that matter for robot deployment. Collection protocols include scenes with transparent containers, reflective packaging, specular metal surfaces, and thin structures. Multi-model depth estimation with uncertainty quantification identifies regions where depth is unreliable, enabling perception models to learn where to trust or distrust depth readings in production.

State-of-the-art monocular depth estimation models like Depth Anything V2 achieve sub-5% relative error on standard benchmarks. For robot applications requiring metric accuracy (e.g., grasp planning), Claru combines AI depth with sparse hardware depth measurements for calibration and validates enriched depth maps against ground-truth sensor data where available. The enrichment pipeline includes per-pixel confidence scores that perception models can use to weight depth reliability.

╔════════════════════╗
║  INITIATE CONTACT  ║
║  ▶ CONNECT NOW     ║
╚════════════════════╝

┌────────────────┐
│ STATUS: READY  │
│ AWAITING INPUT │
└────────────────┘

// INITIATE

Your next hire isn't a vendor.
It's a data team.

Tell us what you're training. We'll scope the dataset.

</>

References

[1]Yang et al.. “Depth Anything V2.” arXiv 2024, 2024. State-of-the-art monocular depth estimation trained on large-scale synthetic and real data; demonstrated that scale and diversity drive depth model quality but noted degradation on transparent and specular surfaces. Link
[2]Piccinelli et al.. “UniDepth: Universal Monocular Metric Depth Estimation.” CVPR 2024, 2024. Self-supervised metric depth estimation that generalizes across camera intrinsics without fine-tuning; accuracy depends on training data scene diversity. Link
[3]Dai et al.. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes.” CVPR 2017, 2017. 2.5M RGB-D frames across 1,513 indoor rooms establishing the largest annotated real-world indoor depth benchmark; limited to single sensor type and indoor environments. Link
[4]Fang et al.. “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping.” CVPR 2020, 2020. Paired depth data with 1 billion grasp annotations across 88 objects; demonstrated value of depth-grasp pairing but limited to controlled tabletop conditions. Link