Depth Estimation

Depth estimation is the task of predicting the distance of every pixel from the camera in a 2D image, creating a depth map that represents the 3D structure of a scene. Monocular depth estimation models predict these depth maps from single RGB images without requiring stereo cameras or LiDAR sensors, enabling 3D scene understanding from standard photographs.

Learn About Depth Estimation

New to depth estimation? Visit our Depth Estimation Concepts Guide to learn about depth maps, monocular depth prediction, and applications in 3D reconstruction and scene understanding.

Available Models

Foundation Models

Foundation models trained on massive diverse datasets with strong zero-shot generalization across domains.

Depth Anything - State-of-the-art foundation model for monocular depth with exceptional zero-shot performance

Common Configuration

Input Requirements

Depth estimation models process single RGB images:

Input: Standard RGB image (any resolution)
Output: Depth map (same resolution as input)

Key Inference Parameters

Model Size: Scale of the model architecture

Small: Fastest inference, lower accuracy, good for real-time
Base: Balanced speed and accuracy
Large: Highest accuracy, slower inference
Choose based on accuracy needs vs speed constraints

Output Type: Format of depth prediction

Depth: Absolute depth values (distance from camera)
Disparity: Inverse depth (1/depth), common in stereo vision
Most applications use depth output

Understanding Depth Maps

Depth Map Representation:

Each pixel contains distance value
Typically normalized to 0-1 range or metric scale
Visualized as grayscale (darker = closer, lighter = farther)
Can be converted to point clouds for 3D reconstruction

Depth vs Disparity:

Depth: Actual distance from camera (meters/units)
Disparity: Inverse depth, linear with image coordinates
Depth more intuitive, disparity better for certain algorithms

Fine-tuning vs Inference

Inference Only (Recommended)

Use pre-trained model on new images
Zero-shot generalization works remarkably well
No training data required
Immediate deployment

Fine-tuning

Customize for specific camera/domain
Requires depth ground truth (LiDAR, stereo, etc.)
Improves accuracy on target domain
Time-intensive but valuable for specialized applications

Understanding Metrics

Absolute Relative Error (Abs Rel)

Measures average relative depth error
Lower is better (0.0 = perfect)
Typical values: 0.05-0.15 for good models

RMSE (Root Mean Square Error)

Measures overall depth prediction error
Lower is better
Sensitive to large errors

Threshold Accuracy (δ < 1.25)

Percentage of pixels within 1.25x of true depth
Higher is better (1.0 = perfect)
Most interpretable metric

Log10 Error

Logarithmic scale error measurement
Less sensitive to distance
Better for large depth ranges

Choosing the Right Model

By Priority

Maximum Accuracy

Depth Anything Large (state-of-the-art zero-shot)
Fine-tuned Depth Anything on domain data
Domain-specific trained models

Fastest Inference

Depth Anything Small (real-time capable)
Reduced input resolution
Optimized inference frameworks

Best Generalization

Depth Anything (trained on 1.5M diverse images)
Foundation models over specialized models
Models with large-scale pre-training

By Use Case

3D Reconstruction

Depth Anything Large for highest accuracy
Output depth maps at full resolution
Use metric depth if available
Post-process for noise reduction

Augmented Reality

Depth Anything Small/Base for speed
Real-time inference requirements
Consistent depth across frames
Edge deployment considerations

Autonomous Navigation

Depth Anything Base/Large for reliability
Accurate depth in diverse conditions
Robust to lighting and weather
Fine-tune on domain data for best results

Scene Understanding

Depth Anything Base sufficient
Relative depth often enough
Focus on consistent spatial relationships
Works well with semantic segmentation

Photography Effects

Depth Anything Base for bokeh/refocus
Depth-based segmentation
Portrait mode simulation
Depth-aware filtering

Best Practices

Input Preparation

Image quality: Higher resolution = better depth detail
Lighting: Models handle various lighting but avoid extreme darkness
Camera calibration: Not required but helps metric depth accuracy
Image format: Standard RGB, any aspect ratio works

Inference Strategy

Start with Large model: Establish accuracy baseline
Optimize if needed: Switch to Base/Small if speed critical
Consistent preprocessing: Normalize consistently across images
Batch processing: Process multiple images together for efficiency

Depth Map Post-processing

Smoothing: Apply bilateral filtering to reduce noise
Edge preservation: Maintain sharp depth discontinuities
Outlier removal: Filter extreme depth values
Metric conversion: Scale to real-world units if calibrated

Fine-tuning Guidelines

Ground truth data: Requires accurate depth measurements (LiDAR, stereo)
Domain consistency: Train on data similar to deployment
Data quantity: 500-5000 paired images for good results
Evaluation: Test on held-out scenes, not just held-out images

Hardware Considerations

GPU recommended: Especially for Large model or high-resolution
CPU capable: Small/Base models work on CPU for non-real-time
Memory: Scales with image resolution
Real-time: Small model on GPU can achieve 30+ FPS

Common Pitfalls

Scale Ambiguity

Problem: Depth map scale inconsistent across images

Solution: Fine-tune on calibrated data, use same camera/settings, apply scale normalization

Edge Artifacts

Problem: Depth discontinuities appear blurred or incorrect

Solution: Use higher resolution, post-process with edge-aware filtering, fine-tune on sharp imagery

Reflective Surfaces

Problem: Mirrors, glass, water show incorrect depth

Solution: Physical limitation of monocular depth, use semantic masks to identify reflective regions

Textureless Regions

Problem: Plain walls or uniform areas have noisy depth

Solution: Apply smoothing in low-texture regions, leverage geometric priors

Indoor vs Outdoor

Problem: Model performs differently in different environments

Solution: Depth Anything handles both well, but fine-tune for specific domain if needed

Metric Depth Accuracy

Problem: Predicted depths don't match real-world measurements

Solution: Requires camera calibration, fine-tune with metric ground truth, or use relative depth only

Advanced Techniques

Point Cloud Generation

Predict depth map
Back-project using camera intrinsics
Generate 3D point cloud
Apply for 3D reconstruction or scene analysis

Multi-frame Depth

Process video frames individually
Temporal consistency filtering
Structure from motion refinement
Improved accuracy from multiple views

Depth-based Segmentation

Use depth discontinuities for boundaries
Cluster by depth similarity
Combine with semantic segmentation
Improved object separation

Scene Completion

Predict depth for visible regions
Inpaint occluded areas
Generate complete 3D scene
Useful for novel view synthesis

Depth-guided Effects

Bokeh Simulation

1. Predict depth map
2. Define focal plane
3. Apply blur proportional to depth difference
4. Simulate shallow depth of field

3D Photo Effects

1. Generate depth map
2. Create layered representation
3. Apply parallax motion
4. Generate animated 3D effect

Depth-aware Compositing

1. Predict depth for all elements
2. Composite based on depth ordering
3. Apply realistic occlusions
4. Depth-consistent integration

Application Patterns

Real-time obstacle detection
Path planning from depth
Grasp pose estimation
Scene understanding for manipulation

AR/VR

Occlusion handling
Virtual object placement
Realistic interactions
Depth-based rendering

Content Creation

Portrait mode effects
3D from 2D conversion
Depth-based compositing
Cinematic effects

Accessibility

Scene description for visually impaired
Obstacle warning systems
Distance estimation
Spatial audio rendering

Autonomous Vehicles

Supplementary depth sensing
Redundancy for sensor fusion
Low-cost depth alternative
Weather-robust perception