Depth Estimation

Predicting the distance of objects from the camera to understand 3D scene structure

Depth estimation is the task of predicting the distance from the camera to every pixel or point in an image, creating a depth map that represents the 3D structure of the scene. It's a fundamental computer vision task that enables machines to understand spatial relationships and 3D geometry from 2D images.

📚 Training Depth Estimation Models

Looking to train depth estimation models? Check out our comprehensive Depth Estimation Training Guide with detailed parameter documentation for all available models and training techniques.

What is Depth Estimation?

Depth estimation takes a 2D image as input and produces a depth map — an image where each pixel value represents the distance from the camera to the corresponding point in the scene. Brighter pixels typically indicate objects closer to the camera, while darker pixels represent farther objects (or vice versa, depending on representation).

Examples:

A photo of a street scene → depth map showing cars nearby and buildings far away
Indoor room image → depth values for furniture, walls, and floor at different distances
Portrait photo → depth map distinguishing person from background
Landscape image → depth gradients from foreground to distant mountains

Applications: Autonomous driving, robotics, AR/VR, 3D reconstruction, photo effects, and accessibility tools.

Key Concepts

Monocular vs. Stereo Depth Estimation

Monocular depth estimation:

Uses a single image as input
More challenging: lacks explicit stereo cues
Relies on learned priors about object sizes, perspective, occlusion
More practical: works with any camera, including existing photos
This is the focus of most modern deep learning approaches

Stereo depth estimation:

Uses two images from different viewpoints (like human eyes)
Triangulates depth through disparity between views
More geometrically grounded
Requires calibrated stereo camera setup
Classical approaches well-established

Multi-view depth estimation:

Uses multiple images from different angles
Structure-from-Motion (SfM) techniques
More accurate but requires multiple captures

Depth Maps

The output of depth estimation — a 2D map where each pixel encodes depth information:

Representation formats:

Inverse depth: $d = 1/z$ (common in learning-based methods)
Disparity: Related to depth in stereo vision
Metric depth: Actual distance in meters
Relative depth: Ordinal relationships without absolute scale

Visualization:

Typically shown as grayscale images
Colormaps (Viridis, Plasma, Turbo) for better perception
Lighter/warmer colors for near, darker/cooler for far (or inverted)

Resolution: Usually matches input image resolution, though some methods predict at different scales.

Relative vs. Absolute Depth

Relative (ordinal) depth:

Predicts depth ordering: which objects are closer or farther
No absolute scale (one scene's "10" might be another's "100")
Sufficient for many applications (photo effects, occlusion reasoning)
Easier to learn: consistent across different scenes
Most general-purpose models predict relative depth

Absolute (metric) depth:

Predicts actual distances in real-world units (meters)
Requires training data with ground-truth metric measurements
Scene-specific scale factor
Essential for robotics, autonomous driving, AR
Harder to generalize across domains

Scale ambiguity: Monocular depth estimation fundamentally cannot determine absolute scale without additional information (like known object sizes).

Disparity

In stereo vision, disparity is the difference in image location of an object when viewed from different positions:

Relationship to depth:

z = \frac{f \cdot b}{d}

where:

$z$ = depth
$f$ = focal length
$b$ = baseline (distance between cameras)
$d$ = disparity

Inverse relationship: Closer objects have larger disparity, farther objects smaller.

Scale Ambiguity

Fundamental challenge in monocular depth estimation:

Problem: From a single image, a small nearby object looks identical to a large far-away object.

Example: A toy car close to the camera vs. a real car far away can produce the same image.

Implications:

Cannot recover absolute metric depth without additional cues
Models learn statistical priors about typical object sizes
Scale may vary between scenes
Fine-tuning on domain-specific data can improve scale consistency

Solutions:

Multi-view methods (structure-from-motion)
Known reference objects in scene
Sensor fusion (camera + LiDAR)
Domain-specific training (e.g., only indoor or only outdoor)

Approaches and Architectures

CNN-Based Methods

Convolutional Neural Networks for dense depth prediction:

Early approaches:

Eigen et al. (2014): Multi-scale architecture, coarse-to-fine prediction
FCRN (Fully Convolutional Residual Networks): Up-convolution for resolution recovery
Typically used encoder-decoder architectures

MiDaS family (Mixed Data Strategy):

MiDaS v2: ResNet or ResNeXt encoder, multi-scale decoder
MiDaS v3: Combines multiple datasets with affine-invariant loss
MiDaS v3.1: Adds smaller efficient models (DPT-Hybrid)
Trained on diverse datasets for generalization
Predicts relative depth (robust across domains)
State-of-the-art zero-shot performance

Key techniques:

Skip connections from encoder to decoder
Multi-scale feature fusion
Up-sampling strategies (transpose convolutions, bilinear interpolation)

Transformer-Based Methods

Vision Transformers applied to depth estimation:

DPT (Dense Prediction Transformer):

Vision Transformer (ViT) backbone
Reassembles tokens at multiple scales
Convolutional decoder for dense predictions
Better global context understanding than CNNs
Included in MiDaS v3 family

Depth Anything (2024):

Large-scale foundation model approach
Trained on massive unlabeled data with pseudo-labels
Strong zero-shot generalization
Versions: Small, Base, Large
Excellent fine-grained detail and edge preservation

DepthFormer:

Transformer encoder with hierarchical features
Efficient attention mechanisms
Competitive accuracy with lower compute

Advantages of Transformers:

Better long-range dependencies
More effective at capturing global scene context
Superior performance with sufficient data
Higher computational cost than CNNs

Self-Supervised Learning

Training depth models without ground-truth depth labels:

Core idea: Use stereo pairs or video sequences and enforce geometric consistency.

Monodepth2:

Predicts depth from monocular video
Loss based on photometric reprojection error
Learns from temporal consistency
No explicit depth supervision needed

Process:

Predict depth for frame t
Predict camera pose between frames t and t+1
Warp frame t+1 to frame t using predicted depth and pose
Minimize reconstruction error between original and warped frame

Formula:

L_{\text{photo}} = \sum_p \min_i \text{SSIM}(I_t(p), I_{t+i}'(p))

where $I_{t+i}'$ is the warped image.

Benefits:

No expensive depth annotations needed
Can leverage abundant video data
Learns from real-world geometric constraints

Challenges:

Moving objects violate static scene assumption
Textureless regions provide little supervision
Scale ambiguity remains

Multi-Task Learning

Learning depth jointly with related tasks:

Common combinations:

Depth + Semantic Segmentation: Shared features benefit both tasks
Depth + Surface Normals: Geometric consistency
Depth + Optical Flow: Motion understanding

Benefits:

Improved generalization through shared representations
Mutual regularization between tasks
More efficient use of training data

Example architecture:

Shared encoder
Task-specific decoder heads
Multi-task loss: $L = \alpha L_{\text{depth}} + \beta L_{\text{segmentation}}$

Evaluation Metrics

Absolute Relative Error (Abs Rel)

Measures relative depth error averaged over all pixels:

\text{Abs Rel} = \frac{1}{|T|} \sum_{i \in T} \frac{|z_i - \hat{z}_i|}{z_i}

where $z_i$ is ground-truth depth and $\hat{z}_i$ is predicted depth for pixel $i$ .

Interpretation:

Lower is better (0 = perfect prediction)
Scale-independent: works for relative and absolute depth
Emphasizes relative accuracy rather than absolute values
Commonly reported metric

Root Mean Squared Error (RMSE)

Standard metric for prediction error:

\text{RMSE} = \sqrt{\frac{1}{|T|} \sum_{i \in T} (z_i - \hat{z}_i)^2}

Interpretation:

Lower is better
Units match depth units (meters for metric depth)
Sensitive to outliers (large errors heavily penalized)
Commonly used alongside RMSE log

Log RMSE

RMSE in logarithmic space:

\text{RMSE log} = \sqrt{\frac{1}{|T|} \sum_{i \in T} (\log z_i - \log \hat{z}_i)^2}

Benefits:

Less sensitive to absolute scale
Treats relative errors more uniformly across depth ranges
Better for relative depth evaluation
More robust to outliers

Threshold Accuracy (δ < 1.25)

Percentage of pixels with relative error below threshold:

\delta_t = \frac{1}{|T|} \sum_{i \in T} \mathbb{1}[\max(\frac{z_i}{\hat{z}_i}, \frac{\hat{z}_i}{z_i}) < t]

Common thresholds:

$\delta < 1.25$ (δ₁)
$\delta < 1.25^2 = 1.5625$ (δ₂)
$\delta < 1.25^3 = 1.953$ (δ₃)

Interpretation:

Higher is better (1.0 = 100% of pixels accurate)
$\delta_1 > 0.9$ indicates very good performance
Scale-invariant metric
Intuitive: "percentage of pixels with small enough error"

Squared Relative Error (Sq Rel)

Squared relative differences:

\text{Sq Rel} = \frac{1}{|T|} \sum_{i \in T} \frac{(z_i - \hat{z}_i)^2}{z_i}

Interpretation:

Lower is better
More heavily penalizes outliers than Abs Rel
Less commonly reported than other metrics

Scale-Invariant Metrics

For relative depth evaluation where absolute scale is irrelevant:

Scale-invariant log error:

\text{SI-log} = \sqrt{\frac{1}{|T|} \sum_i (\log z_i - \log \hat{z}_i - \bar{d})^2}

where $\bar{d}$ is the mean log difference (aligning scales).

Use case: Evaluating models that predict relative depth without absolute scale.

Output Interpretation

Depth Map Visualization

Converting depth values to interpretable images:

Grayscale representation:

Normalize depth values to [0, 255]
Black = far, White = near (or inverted)
Simple but limited perceptual range

Color mapping:

Apply colormaps (Viridis, Plasma, Turbo, Jet)
Better perceptual discrimination of depth levels
More visually appealing
Standard in publications and demos

Example (Python):

import matplotlib.pyplot as plt
depth_colored = plt.cm.viridis(depth_normalized)

Normalization Strategies

Depth maps often require normalization for visualization or downstream tasks:

Min-max normalization:

d_{\text{norm}} = \frac{d - d_{\min}}{d_{\max} - d_{\min}}

Maps to [0, 1] range
Preserves relative ordering
Sensitive to outliers

Percentile clipping:

Clip to 1st and 99th percentiles
Then apply min-max normalization
More robust to outliers and noise

Inverse depth normalization:

Work with $1/d$ instead of $d$
Better numerical properties for distant objects
Common in learning-based methods

Converting to 3D Point Clouds

Depth maps can be unprojected to 3D points:

Camera intrinsics required:

Focal length $f_x, f_y$
Principal point $c_x, c_y$

Unprojection formula for pixel $(u, v)$ with depth $z$ :

X = \frac{(u - c_x) \cdot z}{f_x}

Y = \frac{(v - c_y) \cdot z}{f_y}

Z = z

Result: 3D point cloud representing the scene geometry.

Applications:

3D reconstruction
Mesh generation
Scene understanding
AR/VR rendering

Confidence and Uncertainty

Some methods provide uncertainty estimates alongside depth:

Types:

Aleatoric uncertainty: Inherent noise in data
Epistemic uncertainty: Model uncertainty (lack of knowledge)

Use cases:

Filter unreliable predictions
Adaptive processing based on confidence
Active learning for data collection

Common Challenges

Scale Ambiguity

Problem: Monocular depth cannot determine absolute scale.

Manifestation:

Same model produces different scales for different scenes
Toy objects vs. real objects confusion
Inconsistent metric values

Solutions:

Accept relative depth for applicable use cases
Fine-tune on domain-specific data with consistent scale
Use sensor fusion (camera + LiDAR) for ground truth
Incorporate known object sizes as cues
Multi-view geometry for scale recovery

Reflective and Transparent Surfaces

Problem: Mirrors, glass, water violate appearance-depth consistency.

Why it happens:

Reflected/refracted content doesn't match actual surface depth
Models trained on opaque surfaces struggle
Specular reflections mislead appearance-based methods

Impact:

Windows often assigned incorrect depth
Mirrors show depth of reflected content, not surface
Water bodies may have inconsistent depth

Solutions:

Training data with challenging reflective surfaces
Multi-modal inputs (polarization, thermal)
Explicit modeling of reflectance properties
Post-processing to detect and handle glass/mirrors

Textureless Regions

Problem: Large uniform areas (walls, sky, roads) lack features.

Why it happens:

Deep learning relies on visual patterns
Flat color regions provide little information
Self-supervised methods get weak photometric signal

Impact:

Smooth regions may have noisy or incorrect depth
Over-smoothing or artifacts
Uncertain boundaries

Solutions:

Smoothness priors and regularization
Multi-scale feature extraction
Transformer attention for global context
Edge-aware refinement
Surface normal constraints

Edge Artifacts

Problem: Blurry or inaccurate depth boundaries between objects.

Causes:

Upsampling in decoder loses fine detail
Conflicting depth values at boundaries
Limited resolution in latent representations

Impact:

Fuzzy object boundaries
Halo effects
Depth bleeding across edges

Solutions:

Higher resolution processing
Edge-preserving losses
Guided filtering with image edges
Attention mechanisms for sharp boundaries
Instance-aware depth prediction

Indoor vs. Outdoor Scene Differences

Problem: Performance varies significantly between environments.

Differences:

Indoor: Complex layouts, small spaces, more occlusion, artificial lighting
Outdoor: Larger scales, different depth ranges, natural lighting, weather

Impact:

Models trained on one domain struggle on the other
Depth range assumptions may not transfer
Different typical object distributions

Solutions:

Domain-specific training or fine-tuning
Mixed dataset training (like MiDaS)
Domain adaptation techniques
Separate models for different environments
Adaptive normalization based on scene type

Computational Cost

Trade-off: Accuracy vs. inference speed.

Factors:

Model architecture (CNN vs. Transformer)
Input resolution
Model size (parameters)

Speed requirements:

Real-time robotics: 30+ FPS
Offline 3D reconstruction: Slower acceptable
Mobile AR: Must run on device with limited power

Optimization:

Smaller models (MiDaS-small, Depth Anything-S)
Lower input resolution with upsampling
Model quantization and pruning
Hardware acceleration (TensorRT, ONNX)
Efficient architectures (MobileNet-based encoders)

Practical Applications

3D Reconstruction

Creating 3D models from 2D images:

Process:

Depth estimation for each view
Point cloud generation
Mesh reconstruction (Poisson, TSDF fusion)
Texture mapping

Applications:

Building and environment scanning
Cultural heritage preservation
E-commerce product models
Virtual reality environments

Depth sensing for robots and vehicles:

Use cases:

Obstacle detection and avoidance
Path planning in 3D space
Terrain assessment
Safe distance estimation

Advantages of monocular:

Works with single camera (cost-effective)
Complements LiDAR and radar sensors
Wide field of view

AR/VR Applications

Depth information for immersive experiences:

Applications:

Occlusion handling (virtual objects behind real ones)
Physics simulation (objects interact with environment)
Hand tracking and gesture recognition
Scene understanding and semantic mapping

Requirements:

Real-time performance (30+ FPS)
Accurate depth at interactive ranges
Temporal consistency across frames

Robotics

Depth perception for robot manipulation and interaction:

Use cases:

Grasp planning and manipulation
Navigation in cluttered environments
Human-robot interaction (safe distances)
Object localization and tracking

Challenges:

Need metric depth for precise control
Real-time requirements
Varied lighting and environments

Photo Effects

Depth-based image editing:

Effects:

Bokeh/Portrait mode: Blur background based on depth
3D Photos: Parallax effect from depth (Facebook 3D Photos)
Relighting: Depth-aware lighting adjustments
Depth-based filters: Artistic effects using depth

Approach:

Depth estimation from single photo
Relative depth sufficient (no metric accuracy needed)
Post-processing for smoothness and quality

Popular implementations:

Smartphone portrait modes
Instagram/Snapchat filters
Photo editing software

Accessibility Tools

Depth information for visually impaired users:

Applications:

Audio feedback about obstacles and distances
Haptic feedback for navigation
Describing spatial relationships in scenes
Safe mobility assistance

Requirements:

Real-time depth on mobile devices
Accurate obstacle detection
Reliable in varied environments

Safety and Surveillance

Depth-enhanced monitoring:

Use cases:

Perimeter intrusion detection (depth-based zones)
Fall detection (person height from depth)
Crowd density estimation
Anomaly detection in 3D space

Choosing an Approach

Consider these factors when selecting a depth estimation method:

For general-purpose zero-shot depth:

Depth Anything: Latest, strongest generalization
MiDaS v3.1: Excellent balance, widely used
Predict relative depth, work across domains
Good for photo effects, visualization, initial prototyping

For metric depth estimation:

Fine-tune on domain-specific data with ground truth
Use stereo or LiDAR during training
Essential for robotics and autonomous systems
Consider domain: indoor (NYU Depth v2) vs. outdoor (KITTI)

For real-time applications:

Smaller models (MiDaS-small, Depth Anything-S)
Lower input resolution (e.g., 256×256 or 384×384)
Efficient backbones (MobileNet, EfficientNet)
Optimize with TensorRT or ONNX
Profile on target hardware

For highest accuracy:

Large transformer models (Depth Anything-L, DPT-Large)
High input resolution (512×512 or higher)
Ensemble multiple models
Multi-view or stereo methods if possible
Accept slower inference

For indoor scenes:

Models trained on indoor datasets (NYU Depth v2)
Smaller depth ranges, complex layouts
Fine-tune on similar environments

For outdoor/driving scenes:

Models trained on KITTI or similar
Larger depth ranges
Handle varying lighting and weather

For mobile deployment:

Lightweight architectures
Quantization (INT8 or FP16)
On-device frameworks (TensorFlow Lite, CoreML)
Balance accuracy and latency

Next Steps

Ready to train or fine-tune depth estimation models? Our Depth Estimation Training Guide provides comprehensive documentation on:

Available architectures (MiDaS, DPT, Depth Anything)
Training strategies for metric vs. relative depth
Dataset preparation and augmentation
Fine-tuning on custom domains
Self-supervised training techniques
Inference optimization and deployment

For understanding related computer vision tasks, see:

Image Segmentation - Pixel-level semantic understanding
Object Detection - Localizing objects in images
Text-to-Image Generation - Using depth for enhanced generation
Computer Vision Overview - All vision tasks

Depth Estimation

What is Depth Estimation?

Key Concepts

Monocular vs. Stereo Depth Estimation

Depth Maps

Relative vs. Absolute Depth

Disparity

Scale Ambiguity

Approaches and Architectures

CNN-Based Methods

Transformer-Based Methods

Self-Supervised Learning

Multi-Task Learning

Evaluation Metrics

Absolute Relative Error (Abs Rel)

Root Mean Squared Error (RMSE)

Log RMSE

Threshold Accuracy (δ < 1.25)

Squared Relative Error (Sq Rel)

Scale-Invariant Metrics

Output Interpretation

Depth Map Visualization

Normalization Strategies

Converting to 3D Point Clouds

Confidence and Uncertainty

Common Challenges

Scale Ambiguity

Reflective and Transparent Surfaces

Textureless Regions

Edge Artifacts

Indoor vs. Outdoor Scene Differences

Computational Cost

Practical Applications

3D Reconstruction

Autonomous Navigation

AR/VR Applications

Robotics

Photo Effects

Accessibility Tools

Safety and Surveillance

Choosing an Approach

Next Steps

On this page

Command Palette