Depth Anything V2: Definition, Applications, and Training Data Requirements

Depth Anything V2 is a state-of-the-art monocular depth estimation model that predicts dense, per-pixel depth maps from a single RGB image. Built on a DINOv2 Vision Transformer backbone, it achieves strong zero-shot generalization across domains. This page covers the model architecture, training data strategy, robotics applications, and how depth pseudo-labels enrich video datasets.

What Is Depth Anything V2?

Depth Anything V2 is a monocular depth estimation model that predicts a dense, per-pixel depth map from a single RGB image. Developed by researchers at the University of Hong Kong and TikTok, it builds on the Dense Prediction Transformer (DPT) architecture with a DINOv2 Vision Transformer backbone. The model takes an arbitrary-resolution RGB image as input and outputs a depth map of the same spatial resolution, where each pixel value represents the estimated distance from the camera to the corresponding scene surface.

The architecture follows the encoder-decoder pattern standard in dense prediction tasks. The encoder is a frozen or fine-tuned DINOv2 ViT (available in Small, Base, Large, and Giant variants), which extracts multi-scale feature maps from the input image. The decoder is a DPT head that fuses these multi-scale features through progressive upsampling and fusion modules to produce the final depth prediction. The model is trained with a combination of supervised loss on labeled depth data and self-training loss on pseudo-labeled unlabeled data, using a scale-and-shift-invariant loss that makes training robust to the varying depth ranges and scales present in heterogeneous training datasets.

A defining innovation of Depth Anything V2 relative to its predecessor is the shift from noisy real-world labeled data to high-quality synthetic data for the supervised training component. V1 used 1.5 million images from existing depth datasets (NYUv2, KITTI, HRWSI, etc.) with depth ground truth from various sensors, each with different noise characteristics, missing regions, and depth ranges. V2 replaced this with 595,000 synthetic images with pixel-perfect depth rendered from virtual 3D environments. The authors found that this smaller but higher-quality labeled set produced substantially better depth predictions, because the model learned cleaner depth features from noise-free supervision. The large-scale unlabeled component (62 million real images with pseudo-depth labels from a teacher model) remained, providing the visual diversity needed for zero-shot generalization.

For robotics and physical AI, Depth Anything V2 has become a standard tool for data enrichment. Egocentric video datasets collected with monocular cameras lack depth information, limiting their utility for training manipulation policies that operate in 3D. Running Depth Anything V2 on every frame of these datasets produces dense depth pseudo-labels that enable 3D reasoning without requiring depth sensors during collection. The model's zero-shot generalization means it works across the diverse environments in robotics datasets — kitchens, warehouses, outdoor scenes — without domain-specific fine-tuning. This makes it practical as a batch processing tool for enriching large video catalogs.

The model is available in four sizes: ViT-S (24.8M parameters, fastest inference), ViT-B (97.5M), ViT-L (335.3M), and ViT-G (1.3B, highest accuracy). For robotics deployment where inference speed matters, the ViT-S and ViT-B variants run at 50+ FPS on consumer GPUs. For offline data enrichment where quality matters more than speed, the ViT-L variant provides the best accuracy-efficiency tradeoff.

Historical Context

Monocular depth estimation — predicting 3D depth from a single 2D image — has been studied since Saxena et al. (2005) proposed Make3D, which used hand-crafted features and Markov Random Fields to predict depth from single outdoor images. The field progressed through the CNN era with Eigen et al. (2014) demonstrating that convolutional neural networks could directly regress depth from pixels, achieving results that surpassed all prior methods.

The field's major inflection point came with MiDaS (Ranftl et al., 2020), which introduced the concept of training on a diverse mixture of depth datasets with an affine-invariant loss function. Previous models trained on single datasets (typically NYUv2 for indoor or KITTI for outdoor) generalized poorly across domains. MiDaS showed that combining multiple datasets with different depth ranges, sensors, and domains — while using a loss function invariant to scale and shift differences between datasets — produced a model that generalized zero-shot to entirely new domains. This "train on everything, test on anything" paradigm became the foundation for all subsequent work.

ZoeDepth (Bhat et al., 2023) extended MiDaS by adding metric depth prediction capability, combining a relative depth backbone with a domain-specific metric head. UniDepth (Piccinelli et al., 2024) further improved cross-domain metric depth by jointly predicting depth and camera intrinsics.

Depth Anything V1 (Yang et al., January 2024) achieved a step-function improvement by combining a DINOv2 backbone with massive-scale self-training on 62 million unlabeled images. The DINOv2 encoder provided superior visual features compared to the ResNet and BEiT backbones used by MiDaS and ZoeDepth, and the self-training on unlabeled data provided scale that labeled datasets alone could not match. Depth Anything V2 (Yang et al., June 2024) refined this approach with the insight that synthetic labeled data outperforms real labeled data for the supervised component, further improving zero-shot performance.

The Depth Anything Video extension (2024) added temporal consistency to the model, producing smooth depth maps across video frames rather than treating each frame independently — addressing the temporal flickering artifacts that plagued per-frame depth estimation and making the model directly useful for video data enrichment pipelines.

Practical Implications

For robotics teams, Depth Anything V2 has three primary practical applications: real-time depth for monocular robots, offline data enrichment for video datasets, and depth supervision for training downstream models.

As a real-time depth source, the ViT-S variant runs at 50+ FPS on an NVIDIA RTX 3090, making it fast enough for closed-loop control. However, the depth is relative, not metric, which limits direct use in manipulation planning that requires metric coordinates. The standard integration is to use Depth Anything V2 for object segmentation via depth discontinuities (objects at different depths are easily separated) and obstacle detection (anything closer than a threshold), while relying on hardware depth sensors or stereo cameras for precise metric measurements. For mobile robots with only monocular cameras, the metric depth variants fine-tuned on indoor or outdoor data provide usable metric estimates with approximately 5-10% relative error.

For offline data enrichment, Depth Anything V2 transforms monocular RGB video datasets into RGB-D datasets. This is how Claru enriches our egocentric video catalog: every frame is processed with the ViT-L variant to produce a corresponding depth map, stored as 16-bit PNG with a documented scale factor. These depth-enriched datasets enable downstream applications that would otherwise require depth sensors during collection: training world models with 3D awareness, generating point clouds for spatial reasoning, computing surface normals for grasp planning, and creating 3D reconstructions from video.

Key practical considerations for depth enrichment: the model should be run on the original resolution frames or downsampled to at most 518x518 (the training resolution) — aggressive downsampling degrades depth quality at object boundaries. For temporal consistency in video, applying Depth Anything Video or simple temporal filtering (weighted average across adjacent frames) eliminates flickering. The output depth maps should be stored in a lossless format (16-bit PNG) rather than 8-bit JPEG, because 8-bit quantization introduces visible banding artifacts in depth gradients that degrade downstream use. Processing throughput on a single A100 GPU with the ViT-L model is approximately 800 frames per minute at 518x518 resolution.

Claru uses Depth Anything V2 as part of our multi-model enrichment pipeline, alongside optical flow (RAFT), pose estimation (ViTPose), and vision-language captioning. Every video clip in our catalog receives depth pseudo-labels, enabling clients to access RGB-D data without requiring depth sensors during the original collection.

Common Misconceptions

MYTH

Monocular depth models like Depth Anything V2 can replace hardware depth sensors for robot manipulation.

FACT

Monocular depth estimation produces relative depth with approximately 5-15% error on the metric depth variants and no metric scale on the base model. Hardware depth sensors (Intel RealSense, Stereolabs ZED) provide metric depth with 1-3% error and explicit scale. For precision manipulation tasks — inserting pegs, tightening screws, surgical robotics — hardware depth sensors remain necessary. Depth Anything V2 is most valuable as a complement: filling in depth where hardware sensors fail (transparent objects, very close range, outdoor bright sunlight), enriching monocular datasets with depth pseudo-labels, and providing depth for mobile robots where adding depth hardware is impractical.

MYTH

Depth Anything V2 works equally well on all types of images, including robotics edge cases.

FACT

While Depth Anything V2 has strong zero-shot generalization, it has known failure modes relevant to robotics. Transparent and reflective objects produce unreliable depth because the model has limited training data for these materials. Textureless surfaces like white walls can produce smooth depth gradients that miss actual depth discontinuities. Close-range egocentric views of hands manipulating objects, which are central to robotics datasets, were underrepresented in the training data and show weaker performance than broader scene views. For production robotics, these edge cases should be validated on domain-specific test images before relying on the model's depth predictions.

MYTH

The V2 model is strictly better than V1 in all scenarios.

FACT

Depth Anything V2 achieves higher accuracy on standard benchmarks, but V1 can produce smoother depth maps in some scenarios because it was trained with noisy real-world depth labels that implicitly taught the model to produce conservative, smooth predictions. V2's sharper predictions (from clean synthetic training data) sometimes produce artifacts at depth discontinuities that V1 avoids. For applications where smooth depth gradients matter more than sharp boundaries (e.g., navigation obstacle maps), V1 may be preferable. For applications where precise object boundaries matter (e.g., grasping, segmentation), V2 is clearly superior.

Key Papers

  1. [1]Yang et al.. Depth Anything V2.” NeurIPS 2024, 2024. Link
  2. [2]Yang et al.. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data.” CVPR 2024, 2024. Link
  3. [3]Ranftl et al.. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer.” TPAMI 2020, 2020. Link
  4. [4]Bhat et al.. ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth.” arXiv 2023, 2023. Link
  5. [5]Oquab et al.. DINOv2: Learning Robust Visual Features without Supervision.” TMLR 2024, 2024. Link

How Claru Supports This

Claru uses Depth Anything V2 as a core component of our multi-model video enrichment pipeline. Every clip in our catalog of 3M+ annotations is processed with the ViT-L variant to produce dense per-frame depth maps, stored as 16-bit lossless PNGs with documented scale factors. This transforms our monocular egocentric video collections into RGB-D datasets that support 3D scene understanding, point cloud generation, and surface normal computation without requiring depth sensors during the original collection.

For clients training world models, manipulation policies, or navigation systems that require depth input channels, Claru delivers depth-enriched datasets ready for immediate use. Our depth enrichment pipeline includes temporal consistency filtering to eliminate inter-frame flickering, quality auditing to flag known failure cases (transparent objects, close-range hand views), and calibration against hardware depth sensors where available to provide approximate metric scale factors. Combined with our optical flow (RAFT), pose estimation (ViTPose), and vision-language captioning enrichments, depth maps complete a comprehensive per-frame annotation layer that maximizes the training signal extractable from each video clip.

Frequently Asked Questions

Depth Anything V1 (Yang et al., January 2024) trained a DPT (Dense Prediction Transformer) head on top of a DINOv2 encoder using 62 million unlabeled images with pseudo-depth labels generated by a large teacher model, plus 1.5 million labeled images from existing depth datasets. V2 (Yang et al., June 2024) improved upon V1 in three key ways. First, it replaced the pseudo-labeled real images with 595,000 high-quality synthetic images rendered from precise 3D scenes, finding that synthetic data with perfect depth ground truth produced better results than noisy pseudo-labels from real images. Second, it introduced a more rigorous teacher-student distillation protocol for large-scale unlabeled data. Third, it offered metric depth variants fine-tuned on indoor and outdoor benchmarks, whereas V1 only produced relative depth. V2 achieves state-of-the-art zero-shot relative depth estimation across NYUv2, KITTI, ETH3D, ScanNet, and DIODE benchmarks.

In robotics, Depth Anything V2 serves as a pseudo-sensor that adds depth information to any RGB camera stream. For mobile robots with only monocular cameras, it enables 3D scene understanding without depth hardware. For manipulation systems, it provides dense depth maps that complement sparse depth from structured light sensors in regions those sensors fail (transparent objects, shiny surfaces, thin structures). Common integration patterns include: using predicted depth maps as an input channel to grasp planning networks, generating point clouds from RGB plus predicted depth for 6-DoF pose estimation, enriching egocentric video datasets with per-frame depth maps for training world models, and providing depth supervision signals for training NeRFs or 3D Gaussian splatting from video. The model runs at approximately 30 FPS on an NVIDIA A100 for the ViT-L variant at 518x518 resolution, making it practical for real-time robotics applications.

The base Depth Anything V2 model produces relative (affine-invariant) depth — it correctly orders surfaces from near to far and preserves relative depth ratios, but the absolute scale is undefined. This means two objects predicted at depths 0.3 and 0.6 are correctly identified as having a 2:1 depth ratio, but the actual distances in meters are unknown. For robotics applications requiring metric depth, the authors released fine-tuned variants: Depth Anything V2 Metric trained on NYUv2 (indoor, 0-10m range) and KITTI (outdoor, 0-80m range). These produce depth in meters but are domain-specific. For cross-domain metric depth, the common practice is to calibrate relative depth predictions using known reference measurements — a single known object dimension or camera height provides the scale factor to convert relative depth to metric.

Depth Anything V2's training data has three components. First, 595,000 synthetic images with precise depth ground truth rendered from 3D virtual environments — this replaced the 1.5 million labeled real images used in V1 because synthetic depth is pixel-perfect while real depth sensor data has noise, missing values, and limited range. Second, approximately 62 million unlabeled real images from diverse internet sources, used for self-training via pseudo-labels from a large teacher model. Third, standard depth estimation benchmarks (NYUv2, KITTI) for the metric depth fine-tuned variants. The key insight was that a smaller amount of high-quality synthetic depth data outperformed a larger amount of noisy real depth data as the labeled component, while the massive unlabeled real data provided the visual diversity needed for generalization.

MiDaS (Ranftl et al., 2020, updated through MiDaS v3.1) was the pioneering robust monocular depth model, trained on a mix of 12 depth datasets with affine-invariant loss. It established the paradigm of training on diverse depth data for zero-shot transfer. ZoeDepth (Bhat et al., 2023) built on MiDaS by adding a metric depth head that combines relative depth with metric bin predictions, achieving strong metric depth accuracy. Depth Anything V2 surpasses both on zero-shot benchmarks by leveraging the DINOv2 backbone's superior visual representations and by using synthetic data for the supervised component. On the NYUv2 benchmark, Depth Anything V2 ViT-L achieves a delta-1 accuracy of 0.982 compared to ZoeDepth's 0.955 and MiDaS v3.1's 0.918. For robotics teams, the practical difference is that Depth Anything V2 produces sharper depth boundaries around objects, which is critical for manipulation where precise object-background separation determines grasp success.

Need Depth-Enriched Video Data?

Claru runs Depth Anything V2 across our video catalog, delivering RGB datasets enriched with dense per-frame depth maps for robotics perception and world model training.