Depth Estimation
Predicting the distance of objects from the camera to understand 3D scene structure
Depth estimation is the task of predicting the distance from the camera to every pixel or point in an image, creating a depth map that represents the 3D structure of the scene. It's a fundamental computer vision task that enables machines to understand spatial relationships and 3D geometry from 2D images.
📚 Training Depth Estimation Models
Looking to train depth estimation models? Check out our comprehensive Depth Estimation Training Guide with detailed parameter documentation for all available models and training techniques.
What is Depth Estimation?
Depth estimation takes a 2D image as input and produces a depth map — an image where each pixel value represents the distance from the camera to the corresponding point in the scene. Brighter pixels typically indicate objects closer to the camera, while darker pixels represent farther objects (or vice versa, depending on representation).
Examples:
- A photo of a street scene → depth map showing cars nearby and buildings far away
- Indoor room image → depth values for furniture, walls, and floor at different distances
- Portrait photo → depth map distinguishing person from background
- Landscape image → depth gradients from foreground to distant mountains
Applications: Autonomous driving, robotics, AR/VR, 3D reconstruction, photo effects, and accessibility tools.
Key Concepts
Monocular vs. Stereo Depth Estimation
Monocular depth estimation:
- Uses a single image as input
- More challenging: lacks explicit stereo cues
- Relies on learned priors about object sizes, perspective, occlusion
- More practical: works with any camera, including existing photos
- This is the focus of most modern deep learning approaches
Stereo depth estimation:
- Uses two images from different viewpoints (like human eyes)
- Triangulates depth through disparity between views
- More geometrically grounded
- Requires calibrated stereo camera setup
- Classical approaches well-established
Multi-view depth estimation:
- Uses multiple images from different angles
- Structure-from-Motion (SfM) techniques
- More accurate but requires multiple captures
Depth Maps
The output of depth estimation — a 2D map where each pixel encodes depth information:
Representation formats:
- Inverse depth: (common in learning-based methods)
- Disparity: Related to depth in stereo vision
- Metric depth: Actual distance in meters
- Relative depth: Ordinal relationships without absolute scale
Visualization:
- Typically shown as grayscale images
- Colormaps (Viridis, Plasma, Turbo) for better perception
- Lighter/warmer colors for near, darker/cooler for far (or inverted)
Resolution: Usually matches input image resolution, though some methods predict at different scales.
Relative vs. Absolute Depth
Relative (ordinal) depth:
- Predicts depth ordering: which objects are closer or farther
- No absolute scale (one scene's "10" might be another's "100")
- Sufficient for many applications (photo effects, occlusion reasoning)
- Easier to learn: consistent across different scenes
- Most general-purpose models predict relative depth
Absolute (metric) depth:
- Predicts actual distances in real-world units (meters)
- Requires training data with ground-truth metric measurements
- Scene-specific scale factor
- Essential for robotics, autonomous driving, AR
- Harder to generalize across domains
Scale ambiguity: Monocular depth estimation fundamentally cannot determine absolute scale without additional information (like known object sizes).
Disparity
In stereo vision, disparity is the difference in image location of an object when viewed from different positions:
Relationship to depth:
where:
- = depth
- = focal length
- = baseline (distance between cameras)
- = disparity
Inverse relationship: Closer objects have larger disparity, farther objects smaller.
Scale Ambiguity
Fundamental challenge in monocular depth estimation:
Problem: From a single image, a small nearby object looks identical to a large far-away object.
Example: A toy car close to the camera vs. a real car far away can produce the same image.
Implications:
- Cannot recover absolute metric depth without additional cues
- Models learn statistical priors about typical object sizes
- Scale may vary between scenes
- Fine-tuning on domain-specific data can improve scale consistency
Solutions:
- Multi-view methods (structure-from-motion)
- Known reference objects in scene
- Sensor fusion (camera + LiDAR)
- Domain-specific training (e.g., only indoor or only outdoor)
Approaches and Architectures
CNN-Based Methods
Convolutional Neural Networks for dense depth prediction:
Early approaches:
- Eigen et al. (2014): Multi-scale architecture, coarse-to-fine prediction
- FCRN (Fully Convolutional Residual Networks): Up-convolution for resolution recovery
- Typically used encoder-decoder architectures
MiDaS family (Mixed Data Strategy):
- MiDaS v2: ResNet or ResNeXt encoder, multi-scale decoder
- MiDaS v3: Combines multiple datasets with affine-invariant loss
- MiDaS v3.1: Adds smaller efficient models (DPT-Hybrid)
- Trained on diverse datasets for generalization
- Predicts relative depth (robust across domains)
- State-of-the-art zero-shot performance
Key techniques:
- Skip connections from encoder to decoder
- Multi-scale feature fusion
- Up-sampling strategies (transpose convolutions, bilinear interpolation)
Transformer-Based Methods
Vision Transformers applied to depth estimation:
DPT (Dense Prediction Transformer):
- Vision Transformer (ViT) backbone
- Reassembles tokens at multiple scales
- Convolutional decoder for dense predictions
- Better global context understanding than CNNs
- Included in MiDaS v3 family
Depth Anything (2024):
- Large-scale foundation model approach
- Trained on massive unlabeled data with pseudo-labels
- Strong zero-shot generalization
- Versions: Small, Base, Large
- Excellent fine-grained detail and edge preservation
DepthFormer:
- Transformer encoder with hierarchical features
- Efficient attention mechanisms
- Competitive accuracy with lower compute
Advantages of Transformers:
- Better long-range dependencies
- More effective at capturing global scene context
- Superior performance with sufficient data
- Higher computational cost than CNNs
Self-Supervised Learning
Training depth models without ground-truth depth labels:
Core idea: Use stereo pairs or video sequences and enforce geometric consistency.
Monodepth2:
- Predicts depth from monocular video
- Loss based on photometric reprojection error
- Learns from temporal consistency
- No explicit depth supervision needed
Process:
- Predict depth for frame t
- Predict camera pose between frames t and t+1
- Warp frame t+1 to frame t using predicted depth and pose
- Minimize reconstruction error between original and warped frame
Formula:
where is the warped image.
Benefits:
- No expensive depth annotations needed
- Can leverage abundant video data
- Learns from real-world geometric constraints
Challenges:
- Moving objects violate static scene assumption
- Textureless regions provide little supervision
- Scale ambiguity remains
Multi-Task Learning
Learning depth jointly with related tasks:
Common combinations:
- Depth + Semantic Segmentation: Shared features benefit both tasks
- Depth + Surface Normals: Geometric consistency
- Depth + Optical Flow: Motion understanding
Benefits:
- Improved generalization through shared representations
- Mutual regularization between tasks
- More efficient use of training data
Example architecture:
- Shared encoder
- Task-specific decoder heads
- Multi-task loss:
Evaluation Metrics
Absolute Relative Error (Abs Rel)
Measures relative depth error averaged over all pixels:
where is ground-truth depth and is predicted depth for pixel .
Interpretation:
- Lower is better (0 = perfect prediction)
- Scale-independent: works for relative and absolute depth
- Emphasizes relative accuracy rather than absolute values
- Commonly reported metric
Root Mean Squared Error (RMSE)
Standard metric for prediction error:
Interpretation:
- Lower is better
- Units match depth units (meters for metric depth)
- Sensitive to outliers (large errors heavily penalized)
- Commonly used alongside RMSE log
Log RMSE
RMSE in logarithmic space:
Benefits:
- Less sensitive to absolute scale
- Treats relative errors more uniformly across depth ranges
- Better for relative depth evaluation
- More robust to outliers
Threshold Accuracy (δ < 1.25)
Percentage of pixels with relative error below threshold:
Common thresholds:
- (δ₁)
- (δ₂)
- (δ₃)
Interpretation:
- Higher is better (1.0 = 100% of pixels accurate)
- indicates very good performance
- Scale-invariant metric
- Intuitive: "percentage of pixels with small enough error"
Squared Relative Error (Sq Rel)
Squared relative differences:
Interpretation:
- Lower is better
- More heavily penalizes outliers than Abs Rel
- Less commonly reported than other metrics
Scale-Invariant Metrics
For relative depth evaluation where absolute scale is irrelevant:
Scale-invariant log error:
where is the mean log difference (aligning scales).
Use case: Evaluating models that predict relative depth without absolute scale.
Output Interpretation
Depth Map Visualization
Converting depth values to interpretable images:
Grayscale representation:
- Normalize depth values to [0, 255]
- Black = far, White = near (or inverted)
- Simple but limited perceptual range
Color mapping:
- Apply colormaps (Viridis, Plasma, Turbo, Jet)
- Better perceptual discrimination of depth levels
- More visually appealing
- Standard in publications and demos
Example (Python):
import matplotlib.pyplot as plt
depth_colored = plt.cm.viridis(depth_normalized)Normalization Strategies
Depth maps often require normalization for visualization or downstream tasks:
Min-max normalization:
- Maps to [0, 1] range
- Preserves relative ordering
- Sensitive to outliers
Percentile clipping:
- Clip to 1st and 99th percentiles
- Then apply min-max normalization
- More robust to outliers and noise
Inverse depth normalization:
- Work with instead of
- Better numerical properties for distant objects
- Common in learning-based methods
Converting to 3D Point Clouds
Depth maps can be unprojected to 3D points:
Camera intrinsics required:
- Focal length
- Principal point
Unprojection formula for pixel with depth :
Result: 3D point cloud representing the scene geometry.
Applications:
- 3D reconstruction
- Mesh generation
- Scene understanding
- AR/VR rendering
Confidence and Uncertainty
Some methods provide uncertainty estimates alongside depth:
Types:
- Aleatoric uncertainty: Inherent noise in data
- Epistemic uncertainty: Model uncertainty (lack of knowledge)
Use cases:
- Filter unreliable predictions
- Adaptive processing based on confidence
- Active learning for data collection
Common Challenges
Scale Ambiguity
Problem: Monocular depth cannot determine absolute scale.
Manifestation:
- Same model produces different scales for different scenes
- Toy objects vs. real objects confusion
- Inconsistent metric values
Solutions:
- Accept relative depth for applicable use cases
- Fine-tune on domain-specific data with consistent scale
- Use sensor fusion (camera + LiDAR) for ground truth
- Incorporate known object sizes as cues
- Multi-view geometry for scale recovery
Reflective and Transparent Surfaces
Problem: Mirrors, glass, water violate appearance-depth consistency.
Why it happens:
- Reflected/refracted content doesn't match actual surface depth
- Models trained on opaque surfaces struggle
- Specular reflections mislead appearance-based methods
Impact:
- Windows often assigned incorrect depth
- Mirrors show depth of reflected content, not surface
- Water bodies may have inconsistent depth
Solutions:
- Training data with challenging reflective surfaces
- Multi-modal inputs (polarization, thermal)
- Explicit modeling of reflectance properties
- Post-processing to detect and handle glass/mirrors
Textureless Regions
Problem: Large uniform areas (walls, sky, roads) lack features.
Why it happens:
- Deep learning relies on visual patterns
- Flat color regions provide little information
- Self-supervised methods get weak photometric signal
Impact:
- Smooth regions may have noisy or incorrect depth
- Over-smoothing or artifacts
- Uncertain boundaries
Solutions:
- Smoothness priors and regularization
- Multi-scale feature extraction
- Transformer attention for global context
- Edge-aware refinement
- Surface normal constraints
Edge Artifacts
Problem: Blurry or inaccurate depth boundaries between objects.
Causes:
- Upsampling in decoder loses fine detail
- Conflicting depth values at boundaries
- Limited resolution in latent representations
Impact:
- Fuzzy object boundaries
- Halo effects
- Depth bleeding across edges
Solutions:
- Higher resolution processing
- Edge-preserving losses
- Guided filtering with image edges
- Attention mechanisms for sharp boundaries
- Instance-aware depth prediction
Indoor vs. Outdoor Scene Differences
Problem: Performance varies significantly between environments.
Differences:
- Indoor: Complex layouts, small spaces, more occlusion, artificial lighting
- Outdoor: Larger scales, different depth ranges, natural lighting, weather
Impact:
- Models trained on one domain struggle on the other
- Depth range assumptions may not transfer
- Different typical object distributions
Solutions:
- Domain-specific training or fine-tuning
- Mixed dataset training (like MiDaS)
- Domain adaptation techniques
- Separate models for different environments
- Adaptive normalization based on scene type
Computational Cost
Trade-off: Accuracy vs. inference speed.
Factors:
- Model architecture (CNN vs. Transformer)
- Input resolution
- Model size (parameters)
Speed requirements:
- Real-time robotics: 30+ FPS
- Offline 3D reconstruction: Slower acceptable
- Mobile AR: Must run on device with limited power
Optimization:
- Smaller models (MiDaS-small, Depth Anything-S)
- Lower input resolution with upsampling
- Model quantization and pruning
- Hardware acceleration (TensorRT, ONNX)
- Efficient architectures (MobileNet-based encoders)
Practical Applications
3D Reconstruction
Creating 3D models from 2D images:
Process:
- Depth estimation for each view
- Point cloud generation
- Mesh reconstruction (Poisson, TSDF fusion)
- Texture mapping
Applications:
- Building and environment scanning
- Cultural heritage preservation
- E-commerce product models
- Virtual reality environments
Autonomous Navigation
Depth sensing for robots and vehicles:
Use cases:
- Obstacle detection and avoidance
- Path planning in 3D space
- Terrain assessment
- Safe distance estimation
Advantages of monocular:
- Works with single camera (cost-effective)
- Complements LiDAR and radar sensors
- Wide field of view
AR/VR Applications
Depth information for immersive experiences:
Applications:
- Occlusion handling (virtual objects behind real ones)
- Physics simulation (objects interact with environment)
- Hand tracking and gesture recognition
- Scene understanding and semantic mapping
Requirements:
- Real-time performance (30+ FPS)
- Accurate depth at interactive ranges
- Temporal consistency across frames
Robotics
Depth perception for robot manipulation and interaction:
Use cases:
- Grasp planning and manipulation
- Navigation in cluttered environments
- Human-robot interaction (safe distances)
- Object localization and tracking
Challenges:
- Need metric depth for precise control
- Real-time requirements
- Varied lighting and environments
Photo Effects
Depth-based image editing:
Effects:
- Bokeh/Portrait mode: Blur background based on depth
- 3D Photos: Parallax effect from depth (Facebook 3D Photos)
- Relighting: Depth-aware lighting adjustments
- Depth-based filters: Artistic effects using depth
Approach:
- Depth estimation from single photo
- Relative depth sufficient (no metric accuracy needed)
- Post-processing for smoothness and quality
Popular implementations:
- Smartphone portrait modes
- Instagram/Snapchat filters
- Photo editing software
Accessibility Tools
Depth information for visually impaired users:
Applications:
- Audio feedback about obstacles and distances
- Haptic feedback for navigation
- Describing spatial relationships in scenes
- Safe mobility assistance
Requirements:
- Real-time depth on mobile devices
- Accurate obstacle detection
- Reliable in varied environments
Safety and Surveillance
Depth-enhanced monitoring:
Use cases:
- Perimeter intrusion detection (depth-based zones)
- Fall detection (person height from depth)
- Crowd density estimation
- Anomaly detection in 3D space
Choosing an Approach
Consider these factors when selecting a depth estimation method:
For general-purpose zero-shot depth:
- Depth Anything: Latest, strongest generalization
- MiDaS v3.1: Excellent balance, widely used
- Predict relative depth, work across domains
- Good for photo effects, visualization, initial prototyping
For metric depth estimation:
- Fine-tune on domain-specific data with ground truth
- Use stereo or LiDAR during training
- Essential for robotics and autonomous systems
- Consider domain: indoor (NYU Depth v2) vs. outdoor (KITTI)
For real-time applications:
- Smaller models (MiDaS-small, Depth Anything-S)
- Lower input resolution (e.g., 256×256 or 384×384)
- Efficient backbones (MobileNet, EfficientNet)
- Optimize with TensorRT or ONNX
- Profile on target hardware
For highest accuracy:
- Large transformer models (Depth Anything-L, DPT-Large)
- High input resolution (512×512 or higher)
- Ensemble multiple models
- Multi-view or stereo methods if possible
- Accept slower inference
For indoor scenes:
- Models trained on indoor datasets (NYU Depth v2)
- Smaller depth ranges, complex layouts
- Fine-tune on similar environments
For outdoor/driving scenes:
- Models trained on KITTI or similar
- Larger depth ranges
- Handle varying lighting and weather
For mobile deployment:
- Lightweight architectures
- Quantization (INT8 or FP16)
- On-device frameworks (TensorFlow Lite, CoreML)
- Balance accuracy and latency
Next Steps
Ready to train or fine-tune depth estimation models? Our Depth Estimation Training Guide provides comprehensive documentation on:
- Available architectures (MiDaS, DPT, Depth Anything)
- Training strategies for metric vs. relative depth
- Dataset preparation and augmentation
- Fine-tuning on custom domains
- Self-supervised training techniques
- Inference optimization and deployment
For understanding related computer vision tasks, see:
- Image Segmentation - Pixel-level semantic understanding
- Object Detection - Localizing objects in images
- Text-to-Image Generation - Using depth for enhanced generation
- Computer Vision Overview - All vision tasks