Monocular Depth Estimation: Predicting 3D Depth from a Single Camera

Monocular depth estimation (MDE) predicts a per-pixel depth map from a single RGB image, recovering 3D scene geometry without stereo cameras or LiDAR. Foundation models like Depth Anything and MiDaS have made MDE a practical perception module for robots that need spatial understanding from commodity cameras.

What Is Monocular Depth Estimation?

Monocular depth estimation is the task of predicting the distance from the camera to every visible surface in a scene, using only a single RGB image as input. The output is a dense depth map — a 2D array where each pixel value represents the estimated depth (distance from the camera plane) at that spatial location. This is an inherently ill-posed problem because infinitely many 3D scenes can project to the same 2D image, requiring the model to leverage learned priors about scene geometry, object sizes, texture gradients, and perspective cues.

MDE models are categorized by their output type. Relative depth models predict ordinal depth ordering without metric scale — they can determine that surface A is closer than surface B, but not by how many centimeters. Metric depth models predict absolute distances in real-world units, typically meters. A third hybrid category produces scale-invariant depth that can be converted to metric depth using a single known reference (such as the camera's focal length or a known object dimension).

Architecturally, modern MDE models use encoder-decoder designs with vision transformer (ViT) backbones. The encoder processes the input image through a pretrained ViT to extract multi-scale features that encode both local texture and global scene structure. The decoder upsamples these features back to the input resolution, predicting depth at every pixel. DPT (Ranftl et al., 2021) established this ViT-based architecture, and subsequent models like MiDaS v3.1, ZoeDepth, Depth Anything, and Metric3D refine the encoder-decoder design with better training data, loss functions, and domain adaptation strategies.

For robotics, MDE converts any RGB camera into a pseudo-depth sensor. This is valuable because depth sensors (LiDAR, structured light, time-of-flight) are expensive, have limited range or resolution, fail on certain surfaces (transparent, highly reflective), and add size and weight to the robot platform. MDE from a standard camera provides dense depth at any resolution, works on all surfaces, and adds zero hardware cost. The tradeoff is reduced accuracy compared to active depth sensors, particularly for metric depth where absolute distance matters for manipulation and collision avoidance.

Historical Context

Early approaches to monocular depth estimation used hand-crafted features and probabilistic models. Saxena et al. (2006) trained a Markov Random Field on image patches to predict depth, achieving the first learning-based MDE results. Eigen et al. (2014) introduced the first CNN-based MDE model with a multi-scale architecture, establishing the encoder-decoder paradigm that persists today.

A major advance came from self-supervised learning. Godard et al. (2017) showed that MDE could be trained without ground truth depth by using stereo image pairs and photometric consistency as the supervision signal — predicting depth that, when used to warp the left image to the right viewpoint, reconstructs the right image. Monodepth2 (Godard et al., 2019) refined this approach with per-pixel minimum reprojection loss and auto-masking of static pixels, making self-supervised MDE nearly competitive with supervised methods.

The foundation model era began with MiDaS (Ranftl et al., 2020), which trained on a mix of 12 diverse depth datasets with a scale-and-shift-invariant loss function, achieving unprecedented zero-shot generalization across domains. DPT (Ranftl et al., 2021) upgraded the backbone to vision transformers. ZoeDepth (Bhat et al., 2023) introduced domain-specific metric depth heads on top of the MiDaS backbone, bridging relative and metric depth. Depth Anything (Yang et al., 2024) scaled to 62 million training images with semi-supervised learning, producing the most generalizable MDE model to date. Depth Anything V2 further improved quality and added direct metric depth prediction, making production-quality MDE accessible to any team with a standard camera.

Practical Implications

Deploying MDE in a robotics system requires decisions about model selection, metric calibration, and integration with downstream perception modules.

Model selection depends on the accuracy-latency tradeoff. Depth Anything V2 with a ViT-Large backbone achieves state-of-the-art accuracy but runs at 15-30 FPS on an NVIDIA Jetson AGX Orin. The ViT-Small variant runs at 60+ FPS with slightly reduced accuracy, suitable for real-time navigation. For offline processing (dataset enrichment, 3D reconstruction), larger models and multi-frame fusion methods provide the highest quality. Teams should benchmark candidate models on images from their deployment environment before committing to an architecture.

Metric calibration is necessary for any application requiring absolute distances. Relative depth models must be aligned to metric scale, typically by capturing a small calibration dataset (50-200 frames) with a known depth sensor and fitting a scale-and-shift transformation. ZoeDepth and Metric3D avoid this step by directly predicting metric depth, but their accuracy degrades on environments that differ significantly from their training distribution. For best results, Claru recommends fine-tuning metric heads on 500-2,000 frames from the target environment captured with a calibrated depth sensor.

Integration with downstream modules determines the practical value of MDE predictions. For obstacle avoidance, depth maps are thresholded to create binary free-space maps. For 3D reconstruction, per-pixel depth is unprojected to point clouds using camera intrinsics. For grasp planning, depth maps are fused with instance segmentation masks to estimate per-object 3D geometry. Each integration pathway has different accuracy requirements — navigation tolerates 10% error, while grasp planning requires sub-centimeter accuracy that typically demands sensor fusion (combining MDE with a depth camera or stereo pair) rather than MDE alone.

Common Misconceptions

MYTH

Monocular depth estimation can fully replace depth sensors like LiDAR or RealSense.

FACT

MDE provides excellent relative depth and reasonable metric depth for scene understanding, navigation, and coarse spatial reasoning. However, active depth sensors still outperform MDE for applications requiring sub-centimeter metric accuracy: precision grasping, calibrated 3D scanning, and safety-critical obstacle detection. The practical approach is to use MDE as a complement to depth sensors — providing dense depth on surfaces where active sensors fail (glass, mirrors, dark surfaces) while relying on the sensor for metric accuracy where it succeeds.

MYTH

Foundation MDE models work perfectly out of the box on any domain.

FACT

Models like Depth Anything generalize remarkably well across domains but still exhibit systematic errors on environments that deviate from their training distribution. Underwater scenes, industrial close-ups, medical imaging, and unusual camera perspectives (ceiling-mounted, drone altitude) can produce significant depth errors. Fine-tuning on 500-2,000 domain-specific frames typically resolves these issues, reducing error by 30-50% compared to zero-shot inference.

MYTH

Higher resolution depth maps are always better for robotics.

FACT

Depth maps at native camera resolution (640x480 or 1280x720) contain more spatial detail but also amplify prediction noise, particularly along object boundaries and in textureless regions. For robotic navigation, downsampled depth maps (160x120 or 320x240) provide sufficient spatial resolution while averaging out per-pixel noise. For manipulation, full-resolution depth in the region of interest is valuable, but global full-resolution depth adds computation without benefit. Adaptive resolution — high in the workspace, low in the background — matches compute to information value.

Key Papers

[1]Yang et al.. “Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data.” CVPR 2024, 2024. Link
[2]Ranftl et al.. “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer.” TPAMI 2020, 2020. Link
[3]Bhat et al.. “ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth.” arXiv 2302.12288, 2023. Link
[4]Ranftl et al.. “Vision Transformers for Dense Prediction.” ICCV 2021, 2021. Link
[5]Godard et al.. “Digging into Self-Supervised Monocular Depth Estimation.” ICCV 2019, 2019. Link

How Claru Supports This

Claru captures the paired RGB-depth training data that monocular depth estimation models need for domain-specific fine-tuning and metric calibration. Our data collection network uses calibrated Intel RealSense and Azure Kinect sensors across 100+ cities, producing synchronized RGB-depth frames in the diverse real-world environments — homes, warehouses, retail floors, outdoor spaces — where robots actually operate. Each frame includes camera intrinsics, depth sensor specifications, and environment metadata. For teams deploying Depth Anything or ZoeDepth, Claru provides the 500-5,000 frame calibration datasets that convert general-purpose MDE into metric depth with deployment-specific accuracy. Our enrichment pipeline also runs Depth Anything V2 on existing RGB-only video in our 386,000+ clip catalog, producing pseudo-depth annotations that extend dataset utility for teams whose downstream models consume depth features.

Frequently Asked Questions

Relative depth estimation predicts the ordinal depth relationships between pixels — which surfaces are closer or farther — but not their absolute distances in meters. The output is a disparity or inverse-depth map with arbitrary scale. Models like MiDaS and early Depth Anything produce relative depth, useful for tasks like image editing, visual effects, and scene understanding where exact distances are not needed. Metric depth estimation predicts absolute distances in real-world units (meters), enabling direct use in robotics, autonomous driving, and 3D reconstruction. Metric depth requires either training on datasets with ground truth depth from LiDAR or structured light sensors, or applying a scale recovery step that calibrates relative depth to known object sizes or camera parameters. ZoeDepth and Metric3D bridge this gap by predicting metric depth from monocular images using learned scale priors.

MDE training data consists of RGB images paired with per-pixel depth maps. Ground truth depth is captured using LiDAR sensors (outdoor scenes, 10-100m range, sparse), structured light sensors like Intel RealSense (indoor, 0.3-10m, dense), or time-of-flight cameras (indoor/outdoor, moderate range and density). Major training datasets include NYU Depth V2 (464K indoor frames from Kinect), KITTI (93K outdoor driving frames from Velodyne LiDAR), and the large-scale MegaDepth and Taskonomy datasets that use multi-view stereo to compute dense depth from internet photos. Foundation models like Depth Anything V2 train on a mix of 595K labeled images from diverse sources plus 62 million unlabeled images with pseudo-depth labels generated by a teacher model. For robotics-specific MDE, training data from the target deployment environment — captured with the exact camera the robot will use — is critical for achieving metric accuracy.

Accuracy depends on the model, domain, and whether relative or metric depth is evaluated. On the NYU Depth V2 indoor benchmark, state-of-the-art models achieve absolute relative error (AbsRel) of 0.05-0.06 and delta-1 accuracy (percentage of pixels where predicted/true depth ratio is within 1.25) above 98%. On KITTI outdoor driving scenes, AbsRel is 0.05-0.06 with delta-1 above 97%. However, these benchmarks represent in-distribution performance. On novel environments not seen during training, errors can increase 2-5x. For robotic manipulation at tabletop scale (0.3-2m), well-calibrated metric depth models achieve 2-5cm absolute error, sufficient for coarse collision avoidance and scene understanding but not precise enough for grasp planning without additional sensing. For navigation, 5-10% relative error is generally sufficient for obstacle avoidance.

Depth Anything V1 and V2 (Yang et al., 2024) represent a paradigm shift toward generalizable depth estimation. By training on 62 million diverse images with a semi-supervised teacher-student framework, Depth Anything produces high-quality relative depth maps on virtually any input image without domain-specific fine-tuning. For robotics teams, this means MDE is now a commodity perception module rather than a custom model that must be trained per environment. Depth Anything V2 further introduces metric depth heads fine-tuned on specific domains (indoor, outdoor, driving), providing absolute depth without a separate calibration step. The practical impact is that teams can add depth perception to any RGB camera by running Depth Anything inference, then optionally fine-tune on domain-specific data for metric accuracy.

Claru captures paired RGB-depth data for MDE training and evaluation using calibrated depth sensors (Intel RealSense D455, Azure Kinect) across diverse real-world environments. Our data collection network operates in 100+ cities, capturing indoor and outdoor scenes with the environmental variety — lighting conditions, surface materials, scene layouts — that determines whether MDE models generalize beyond controlled lab settings. Each RGB-depth pair includes camera intrinsic calibration, depth sensor noise characterization, and environment metadata (indoor/outdoor, scene type, distance range). For teams fine-tuning Depth Anything or ZoeDepth on domain-specific data, Claru provides the labeled calibration datasets (500-5,000 frames per environment type) that convert general-purpose MDE into deployment-ready metric depth perception.