Object Detection

Object detection combines the tasks of localization and classification: it identifies where objects are in an image (with bounding boxes) and what categories they belong to. Unlike image classification which labels the entire image, object detection can find multiple objects of different types in a single image.

📚 Training Object Detection Models

Looking to train object detection models? Check out our comprehensive Object Detection Training Guide with detailed parameter documentation for all available models.

What is Object Detection?

Object detection answers two questions simultaneously:

What objects are in the image? (Classification)
Where are they located? (Localization)

For each detected object, the model outputs:

Bounding box coordinates: (x, y, width, height) defining the object's location
Class label: The object category (e.g., "person", "car", "dog")
Confidence score: The model's certainty in the detection (0-1)

Examples:

Autonomous vehicles detecting pedestrians, cars, and traffic signs
Surveillance systems identifying people and suspicious objects
Retail analytics counting customers and tracking product interactions
Medical imaging locating tumors or abnormalities

Key Concepts

Bounding Boxes

Rectangular regions defined by coordinates, with multiple representation formats:

Format variations:

(x₁, y₁, x₂, y₂): Top-left and bottom-right corners
(x_center, y_center, width, height): YOLO format
(x, y, width, height): COCO format with top-left corner
Normalized vs. absolute: Coordinates as pixels or fractions of image size

Intersection over Union (IoU)

A fundamental metric measuring overlap between predicted and ground-truth boxes:

\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}

IoU = 1.0: Perfect overlap
IoU = 0.0: No overlap
IoU ≥ 0.5: Commonly considered a "correct" detection
Used both for evaluation and during training (matching predictions to ground truth)

Anchor Boxes

Predefined boxes of various sizes and aspect ratios used as references:

Purpose: Simplify the detection problem by predicting offsets from anchors
Design: Typically chosen based on object statistics in your dataset
K-means clustering: Common method to determine optimal anchor sizes
Modern approaches: Anchor-free methods eliminate this requirement

Non-Maximum Suppression (NMS)

Post-processing to eliminate duplicate detections:

Sort detections by confidence score
Keep the highest-scoring detection
Remove detections with IoU > threshold (typically 0.5) with kept detection
Repeat for remaining detections

Variants:

Soft-NMS: Reduces scores instead of removing boxes
Class-aware NMS: Apply separately per class
Distance-based NMS: Consider geometric relationships

Detection Approaches

Two-Stage Detectors: R-CNN Family

These detectors first propose regions, then classify them:

R-CNN (2014): Regions with CNN features

Use selective search to generate ~2000 region proposals
Extract CNN features from each region
Classify with SVM
Slow but accurate

Fast R-CNN (2015): Faster processing

Process entire image with CNN once
Project region proposals onto feature map
RoI pooling for fixed-size features
Single-stage training

Faster R-CNN (2015): Learnable proposals

Replace selective search with Region Proposal Network (RPN)
End-to-end trainable
~5-10 FPS inference
Strong baseline for accuracy

Mask R-CNN (2017): Instance segmentation extension

Adds mask prediction branch
Used for both detection and segmentation

Cascade R-CNN: Progressive refinement

Multiple detection heads with increasing IoU thresholds
Addresses mismatch between training and inference

One-Stage Detectors: Speed-Optimized

These detectors predict directly from feature maps without region proposals:

YOLO (You Only Look Once) series:

YOLOv1-v3: Pioneered single-shot detection, 30+ FPS
YOLOv4-v5: Improved accuracy and speed balance
YOLOv8: Latest with enhanced architecture and training
Divide image into grid, predict boxes per cell
Extremely fast, suitable for real-time applications

SSD (Single Shot MultiBox Detector):

Multi-scale feature maps for different object sizes
Faster than Faster R-CNN, more accurate than early YOLO
Good balance of speed and accuracy

RetinaNet:

Feature Pyramid Network (FPN) for multi-scale detection
Focal Loss: Addresses class imbalance in one-stage detectors
Competitive accuracy with two-stage methods

EfficientDet:

Compound scaling for detection networks
Weighted bi-directional FPN (BiFPN)
State-of-the-art efficiency across model sizes

Transformer-Based Detectors

Modern approaches using attention mechanisms:

DETR (Detection Transformer):

End-to-end detection without hand-designed components
No NMS or anchor boxes needed
Bipartite matching loss
Slower convergence than CNN-based methods

Deformable DETR:

Addresses DETR's slow convergence and high memory usage
Deformable attention for efficient multi-scale features
10× faster convergence

DINO (DETR with Improved DeNoising Anchor Boxes):

Enhanced training techniques
State-of-the-art accuracy
Better small object detection

Speed vs. Accuracy Trade-offs

Real-time (30+ FPS):

YOLOv5/v8 (small/medium variants)
SSD with lightweight backbones
Use for: Video processing, edge devices, robotics

Balanced (5-15 FPS):

YOLOv8 (large variants)
EfficientDet
RetinaNet
Use for: General applications, moderate real-time needs

High accuracy (< 5 FPS):

Cascade R-CNN
Large DETR variants
Ensemble methods
Use for: Offline processing, critical accuracy requirements

Evaluation Metrics

Mean Average Precision (mAP)

The primary metric for object detection, combining precision and recall across classes:

Calculation steps:

For each class, compute Precision-Recall curve
Calculate Average Precision (AP) as area under PR curve
Average AP across all classes to get mAP

Common variants:

mAP@0.5: IoU threshold of 0.5 (PASCAL VOC metric)
mAP@0.5:0.95: Average mAP at IoU thresholds from 0.5 to 0.95 (COCO metric, more strict)
mAP@0.75: Stricter IoU requirement

\text{mAP} = \frac{1}{N} \sum_{i=1}^{N} \text{AP}_i

where N is the number of classes.

Precision and Recall

Precision: Fraction of detections that are correct

\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}

Recall: Fraction of ground-truth objects detected

\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

A detection is a True Positive if:

Class prediction is correct
IoU with ground truth ≥ threshold (typically 0.5)

Precision-Recall Curve

Plots precision vs. recall at different confidence thresholds:

High threshold: Few detections, high precision, low recall
Low threshold: Many detections, low precision, high recall
Ideal: Maintains high precision across all recall levels
AP: Area under this curve

Object Size Metrics (COCO)

COCO dataset provides size-specific metrics:

mAP^small: Objects with area < 32² pixels
mAP^medium: Objects with 32² < area < 96² pixels
mAP^large: Objects with area > 96² pixels

Useful for understanding model behavior on different object scales.

Frames Per Second (FPS)

Inference speed metric:

Measured on specific hardware
Higher is better for real-time applications
Trade-off with accuracy
Consider full pipeline (preprocessing + model + postprocessing)

Annotation Formats

COCO Format (JSON)

Microsoft COCO dataset format, widely used:

{
  "images": [{"id": 1, "file_name": "image.jpg", "width": 640, "height": 480}],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [x, y, width, height],
      "area": width * height,
      "iscrowd": 0
    }
  ],
  "categories": [{"id": 1, "name": "person"}]
}

Characteristics:

Supports segmentation masks in addition to boxes
(x, y) is top-left corner
Single JSON file for entire dataset

PASCAL VOC Format (XML)

XML files, one per image:

<annotation>
  <filename>image.jpg</filename>
  <size><width>640</width><height>480</height></size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin><ymin>100</ymin>
      <xmax>200</xmax><ymax>300</ymax>
    </bndbox>
  </object>
</annotation>

Characteristics:

One XML file per image
(xmin, ymin, xmax, ymax) format
Human-readable

YOLO Format (Text)

Simple text files, one per image:

class_id x_center y_center width height
0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25

Characteristics:

All values normalized to [0, 1]
One line per object
Separate classes.txt file lists class names
Extremely simple and fast to parse

Conversion Tools

Most frameworks provide conversion utilities:

COCO ↔ YOLO converters widely available
PASCAL VOC conversion supported in most libraries
Custom formats can be converted with scripting

Data Requirements

Dataset Size

Depends on task complexity and approach:

Transfer learning: 100-1000 annotated images (minimum)
Training from scratch: 10,000+ images recommended
Fine-tuning: Can work with smaller datasets (50-500 images)
Complex scenes: More data needed for multi-object scenarios

Annotation Quality

High-quality annotations are critical:

Tight bounding boxes: Minimize background pixels
Consistent labeling: Same objects labeled the same way
Complete annotations: Don't miss objects in crowded scenes
Handle occlusion: Annotate partially visible objects
Class definitions: Clear guidelines for edge cases

Data Diversity

Model needs to see varied examples:

Multiple viewpoints: Front, side, angled views
Various scales: Near and far objects
Different lighting: Indoor, outdoor, day, night
Backgrounds: Cluttered and clean environments
Occlusion levels: Fully visible and partially hidden objects

Class Balance

Aim for balanced representation:

Similar number of instances per class
If imbalanced, use weighted losses or oversampling
Monitor per-class metrics during training
Consider combining rare classes

Common Challenges

Small Object Detection

Objects occupying few pixels are hard to detect:

Problems: Low resolution, little detail, fewer features
Solutions:
- Multi-scale feature pyramids (FPN)
- Higher input resolution
- Specialized small-object detectors
- Tile-based processing for very high-res images
Anchors: Design smaller anchor boxes
Augmentation: Careful with downscaling

Occlusion and Crowding

Objects partially hidden or overlapping:

Problems: Incomplete visual information, ambiguous boundaries
Solutions:
- Train on occluded examples
- Attention mechanisms to focus on visible parts
- Context modeling to infer hidden portions
- Soft-NMS to preserve overlapping detections
Annotation: Label even partially visible objects

Class Imbalance

Some classes appear far more frequently:

Problems: Model biased toward common classes, poor rare class performance
Solutions:
- Focal loss to down-weight easy examples
- Class-balanced sampling during training
- Weighted losses
- Oversample rare classes or undersample common ones
Evaluation: Report per-class metrics, not just overall mAP

Speed vs. Accuracy Tradeoff

Real-time requirements conflict with accuracy goals:

Analysis: Profile your application's speed requirements
Solutions:
- Choose appropriate architecture for your use case
- Model compression: Quantization, pruning
- Hardware acceleration: TensorRT, ONNX Runtime
- Resolution reduction (carefully)
Testing: Measure actual inference time on target hardware

Background False Positives

Model detects objects where none exist:

Problems: Cluttered backgrounds, similar patterns
Solutions:
- Add "background" or "negative" training examples
- Adjust confidence thresholds
- Focal loss reduces focus on easy negatives
- Hard negative mining
Augmentation: Include challenging background images

Domain Shift

Performance drops in new environments:

Problems: Different camera angles, lighting, image quality
Solutions:
- Include diverse training data
- Domain adaptation techniques
- Test-time augmentation
- Fine-tune on target domain
Validation: Test on data from deployment environment

Practical Applications

Autonomous Vehicles

Pedestrian and vehicle detection
Traffic sign and signal recognition
Lane and road boundary detection
Obstacle identification

Surveillance and Security

Person detection and tracking
Suspicious object identification
Crowd monitoring
Perimeter intrusion detection

Retail Analytics

Customer counting and tracking
Product recognition on shelves
Queue length monitoring
Inventory management

Manufacturing and Quality Control

Defect detection on production lines
Part identification and counting
Assembly verification
Safety equipment detection (hard hats, vests)

Medical Imaging

Tumor detection in CT/MRI scans
Organ localization
Anatomical landmark detection
Cell counting in microscopy

Agriculture

Crop disease detection
Fruit counting for yield estimation
Weed detection
Livestock monitoring

Wildlife Conservation

Animal detection in camera traps
Species identification
Population counting
Poaching detection

Augmented Reality

Object recognition for AR overlays
Spatial understanding
Hand and gesture detection
Face and facial landmark detection

Choosing an Approach

Consider these factors when selecting a detection method:

For real-time applications (robotics, video analytics):

Use one-stage detectors: YOLOv8, SSD
Accept slightly lower accuracy for speed
Optimize for your specific hardware
Consider edge-optimized variants (YOLO-nano, MobileNet-SSD)

For highest accuracy (medical, critical safety):

Use two-stage detectors: Cascade R-CNN, Mask R-CNN
Or large transformer models: DINO, Deformable DETR
Can afford slower inference
Ensemble multiple models if needed

For small objects (aerial imagery, microscopy):

Feature pyramid networks essential
Higher input resolution
Specialized architectures like CascadeRCNN
Consider tile-based processing

For limited data:

Transfer learning from COCO-pretrained models
Data augmentation critical
Consider few-shot detection methods
Active learning to prioritize annotation

For edge deployment (mobile, embedded):

Lightweight architectures: MobileNet-SSD, YOLO-nano
Model quantization and pruning
Profile on actual target hardware
Optimize preprocessing pipeline

Next Steps

Ready to train your own object detection models? Our Object Detection Training Guide provides comprehensive documentation on:

Available architectures (YOLO, Faster R-CNN, RetinaNet, etc.)
Hyperparameter configuration
Data preparation and augmentation strategies
Training optimization and debugging

For understanding related computer vision tasks, see:

Image Classification - Single label per image
Image Segmentation - Pixel-level understanding
Computer Vision Overview - All vision tasks

Object Detection

What is Object Detection?

Key Concepts

Bounding Boxes

Intersection over Union (IoU)

Anchor Boxes

Non-Maximum Suppression (NMS)

Detection Approaches

Two-Stage Detectors: R-CNN Family

One-Stage Detectors: Speed-Optimized

Transformer-Based Detectors

Speed vs. Accuracy Trade-offs

Evaluation Metrics

Mean Average Precision (mAP)

Precision and Recall

Precision-Recall Curve

Object Size Metrics (COCO)

Frames Per Second (FPS)

Annotation Formats

COCO Format (JSON)

PASCAL VOC Format (XML)

YOLO Format (Text)

Conversion Tools

Data Requirements

Dataset Size

Annotation Quality

Data Diversity

Class Balance

Common Challenges

Small Object Detection

Occlusion and Crowding

Class Imbalance

Speed vs. Accuracy Tradeoff

Background False Positives

Domain Shift

Practical Applications

Autonomous Vehicles

Surveillance and Security

Retail Analytics

Manufacturing and Quality Control

Medical Imaging

Agriculture

Wildlife Conservation

Augmented Reality

Choosing an Approach

Next Steps

On this page

Command Palette