Dokumentation (english)

Depth Estimation

Predict depth maps from single RGB images for 3D scene understanding

Depth estimation is the task of predicting the distance of every pixel from the camera in a 2D image, creating a depth map that represents the 3D structure of a scene. Monocular depth estimation models predict these depth maps from single RGB images without requiring stereo cameras or LiDAR sensors, enabling 3D scene understanding from standard photographs.

Learn About Depth Estimation

New to depth estimation? Visit our Depth Estimation Concepts Guide to learn about depth maps, monocular depth prediction, and applications in 3D reconstruction and scene understanding.

Available Models

Foundation Models

Foundation models trained on massive diverse datasets with strong zero-shot generalization across domains.

  • Depth Anything - State-of-the-art foundation model for monocular depth with exceptional zero-shot performance

Common Configuration

Input Requirements

Depth estimation models process single RGB images:

Input: Standard RGB image (any resolution)
Output: Depth map (same resolution as input)

Key Inference Parameters

Model Size: Scale of the model architecture

  • Small: Fastest inference, lower accuracy, good for real-time
  • Base: Balanced speed and accuracy
  • Large: Highest accuracy, slower inference
  • Choose based on accuracy needs vs speed constraints

Output Type: Format of depth prediction

  • Depth: Absolute depth values (distance from camera)
  • Disparity: Inverse depth (1/depth), common in stereo vision
  • Most applications use depth output

Understanding Depth Maps

Depth Map Representation:

  • Each pixel contains distance value
  • Typically normalized to 0-1 range or metric scale
  • Visualized as grayscale (darker = closer, lighter = farther)
  • Can be converted to point clouds for 3D reconstruction

Depth vs Disparity:

  • Depth: Actual distance from camera (meters/units)
  • Disparity: Inverse depth, linear with image coordinates
  • Depth more intuitive, disparity better for certain algorithms

Fine-tuning vs Inference

Inference Only (Recommended)

  • Use pre-trained model on new images
  • Zero-shot generalization works remarkably well
  • No training data required
  • Immediate deployment

Fine-tuning

  • Customize for specific camera/domain
  • Requires depth ground truth (LiDAR, stereo, etc.)
  • Improves accuracy on target domain
  • Time-intensive but valuable for specialized applications

Understanding Metrics

Absolute Relative Error (Abs Rel)

  • Measures average relative depth error
  • Lower is better (0.0 = perfect)
  • Typical values: 0.05-0.15 for good models

RMSE (Root Mean Square Error)

  • Measures overall depth prediction error
  • Lower is better
  • Sensitive to large errors

Threshold Accuracy (δ < 1.25)

  • Percentage of pixels within 1.25x of true depth
  • Higher is better (1.0 = perfect)
  • Most interpretable metric

Log10 Error

  • Logarithmic scale error measurement
  • Less sensitive to distance
  • Better for large depth ranges

Choosing the Right Model

By Priority

Maximum Accuracy

  1. Depth Anything Large (state-of-the-art zero-shot)
  2. Fine-tuned Depth Anything on domain data
  3. Domain-specific trained models

Fastest Inference

  1. Depth Anything Small (real-time capable)
  2. Reduced input resolution
  3. Optimized inference frameworks

Best Generalization

  1. Depth Anything (trained on 1.5M diverse images)
  2. Foundation models over specialized models
  3. Models with large-scale pre-training

By Use Case

3D Reconstruction

  • Depth Anything Large for highest accuracy
  • Output depth maps at full resolution
  • Use metric depth if available
  • Post-process for noise reduction

Augmented Reality

  • Depth Anything Small/Base for speed
  • Real-time inference requirements
  • Consistent depth across frames
  • Edge deployment considerations

Autonomous Navigation

  • Depth Anything Base/Large for reliability
  • Accurate depth in diverse conditions
  • Robust to lighting and weather
  • Fine-tune on domain data for best results

Scene Understanding

  • Depth Anything Base sufficient
  • Relative depth often enough
  • Focus on consistent spatial relationships
  • Works well with semantic segmentation

Photography Effects

  • Depth Anything Base for bokeh/refocus
  • Depth-based segmentation
  • Portrait mode simulation
  • Depth-aware filtering

Best Practices

Input Preparation

  1. Image quality: Higher resolution = better depth detail
  2. Lighting: Models handle various lighting but avoid extreme darkness
  3. Camera calibration: Not required but helps metric depth accuracy
  4. Image format: Standard RGB, any aspect ratio works

Inference Strategy

  1. Start with Large model: Establish accuracy baseline
  2. Optimize if needed: Switch to Base/Small if speed critical
  3. Consistent preprocessing: Normalize consistently across images
  4. Batch processing: Process multiple images together for efficiency

Depth Map Post-processing

  1. Smoothing: Apply bilateral filtering to reduce noise
  2. Edge preservation: Maintain sharp depth discontinuities
  3. Outlier removal: Filter extreme depth values
  4. Metric conversion: Scale to real-world units if calibrated

Fine-tuning Guidelines

  1. Ground truth data: Requires accurate depth measurements (LiDAR, stereo)
  2. Domain consistency: Train on data similar to deployment
  3. Data quantity: 500-5000 paired images for good results
  4. Evaluation: Test on held-out scenes, not just held-out images

Hardware Considerations

  • GPU recommended: Especially for Large model or high-resolution
  • CPU capable: Small/Base models work on CPU for non-real-time
  • Memory: Scales with image resolution
  • Real-time: Small model on GPU can achieve 30+ FPS

Common Pitfalls

Scale Ambiguity

Problem: Depth map scale inconsistent across images

Solution: Fine-tune on calibrated data, use same camera/settings, apply scale normalization

Edge Artifacts

Problem: Depth discontinuities appear blurred or incorrect

Solution: Use higher resolution, post-process with edge-aware filtering, fine-tune on sharp imagery

Reflective Surfaces

Problem: Mirrors, glass, water show incorrect depth

Solution: Physical limitation of monocular depth, use semantic masks to identify reflective regions

Textureless Regions

Problem: Plain walls or uniform areas have noisy depth

Solution: Apply smoothing in low-texture regions, leverage geometric priors

Indoor vs Outdoor

Problem: Model performs differently in different environments

Solution: Depth Anything handles both well, but fine-tune for specific domain if needed

Metric Depth Accuracy

Problem: Predicted depths don't match real-world measurements

Solution: Requires camera calibration, fine-tune with metric ground truth, or use relative depth only

Advanced Techniques

Point Cloud Generation

  1. Predict depth map
  2. Back-project using camera intrinsics
  3. Generate 3D point cloud
  4. Apply for 3D reconstruction or scene analysis

Multi-frame Depth

  1. Process video frames individually
  2. Temporal consistency filtering
  3. Structure from motion refinement
  4. Improved accuracy from multiple views

Depth-based Segmentation

  1. Use depth discontinuities for boundaries
  2. Cluster by depth similarity
  3. Combine with semantic segmentation
  4. Improved object separation

Scene Completion

  1. Predict depth for visible regions
  2. Inpaint occluded areas
  3. Generate complete 3D scene
  4. Useful for novel view synthesis

Depth-guided Effects

Bokeh Simulation

1. Predict depth map
2. Define focal plane
3. Apply blur proportional to depth difference
4. Simulate shallow depth of field

3D Photo Effects

1. Generate depth map
2. Create layered representation
3. Apply parallax motion
4. Generate animated 3D effect

Depth-aware Compositing

1. Predict depth for all elements
2. Composite based on depth ordering
3. Apply realistic occlusions
4. Depth-consistent integration

Application Patterns

Robotics and Navigation

  • Real-time obstacle detection
  • Path planning from depth
  • Grasp pose estimation
  • Scene understanding for manipulation

AR/VR

  • Occlusion handling
  • Virtual object placement
  • Realistic interactions
  • Depth-based rendering

Content Creation

  • Portrait mode effects
  • 3D from 2D conversion
  • Depth-based compositing
  • Cinematic effects

Accessibility

  • Scene description for visually impaired
  • Obstacle warning systems
  • Distance estimation
  • Spatial audio rendering

Autonomous Vehicles

  • Supplementary depth sensing
  • Redundancy for sensor fusion
  • Low-cost depth alternative
  • Weather-robust perception

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items