Object Detection
Locating and classifying multiple objects within images using bounding boxes
Object detection combines the tasks of localization and classification: it identifies where objects are in an image (with bounding boxes) and what categories they belong to. Unlike image classification which labels the entire image, object detection can find multiple objects of different types in a single image.
📚 Training Object Detection Models
Looking to train object detection models? Check out our comprehensive Object Detection Training Guide with detailed parameter documentation for all available models.
What is Object Detection?
Object detection answers two questions simultaneously:
- What objects are in the image? (Classification)
- Where are they located? (Localization)
For each detected object, the model outputs:
- Bounding box coordinates: (x, y, width, height) defining the object's location
- Class label: The object category (e.g., "person", "car", "dog")
- Confidence score: The model's certainty in the detection (0-1)
Examples:
- Autonomous vehicles detecting pedestrians, cars, and traffic signs
- Surveillance systems identifying people and suspicious objects
- Retail analytics counting customers and tracking product interactions
- Medical imaging locating tumors or abnormalities
Key Concepts
Bounding Boxes
Rectangular regions defined by coordinates, with multiple representation formats:
Format variations:
- (x₁, y₁, x₂, y₂): Top-left and bottom-right corners
- (x_center, y_center, width, height): YOLO format
- (x, y, width, height): COCO format with top-left corner
- Normalized vs. absolute: Coordinates as pixels or fractions of image size
Intersection over Union (IoU)
A fundamental metric measuring overlap between predicted and ground-truth boxes:
- IoU = 1.0: Perfect overlap
- IoU = 0.0: No overlap
- IoU ≥ 0.5: Commonly considered a "correct" detection
- Used both for evaluation and during training (matching predictions to ground truth)
Anchor Boxes
Predefined boxes of various sizes and aspect ratios used as references:
- Purpose: Simplify the detection problem by predicting offsets from anchors
- Design: Typically chosen based on object statistics in your dataset
- K-means clustering: Common method to determine optimal anchor sizes
- Modern approaches: Anchor-free methods eliminate this requirement
Non-Maximum Suppression (NMS)
Post-processing to eliminate duplicate detections:
- Sort detections by confidence score
- Keep the highest-scoring detection
- Remove detections with IoU > threshold (typically 0.5) with kept detection
- Repeat for remaining detections
Variants:
- Soft-NMS: Reduces scores instead of removing boxes
- Class-aware NMS: Apply separately per class
- Distance-based NMS: Consider geometric relationships
Detection Approaches
Two-Stage Detectors: R-CNN Family
These detectors first propose regions, then classify them:
R-CNN (2014): Regions with CNN features
- Use selective search to generate ~2000 region proposals
- Extract CNN features from each region
- Classify with SVM
- Slow but accurate
Fast R-CNN (2015): Faster processing
- Process entire image with CNN once
- Project region proposals onto feature map
- RoI pooling for fixed-size features
- Single-stage training
Faster R-CNN (2015): Learnable proposals
- Replace selective search with Region Proposal Network (RPN)
- End-to-end trainable
- ~5-10 FPS inference
- Strong baseline for accuracy
Mask R-CNN (2017): Instance segmentation extension
- Adds mask prediction branch
- Used for both detection and segmentation
Cascade R-CNN: Progressive refinement
- Multiple detection heads with increasing IoU thresholds
- Addresses mismatch between training and inference
One-Stage Detectors: Speed-Optimized
These detectors predict directly from feature maps without region proposals:
YOLO (You Only Look Once) series:
- YOLOv1-v3: Pioneered single-shot detection, 30+ FPS
- YOLOv4-v5: Improved accuracy and speed balance
- YOLOv8: Latest with enhanced architecture and training
- Divide image into grid, predict boxes per cell
- Extremely fast, suitable for real-time applications
SSD (Single Shot MultiBox Detector):
- Multi-scale feature maps for different object sizes
- Faster than Faster R-CNN, more accurate than early YOLO
- Good balance of speed and accuracy
RetinaNet:
- Feature Pyramid Network (FPN) for multi-scale detection
- Focal Loss: Addresses class imbalance in one-stage detectors
- Competitive accuracy with two-stage methods
EfficientDet:
- Compound scaling for detection networks
- Weighted bi-directional FPN (BiFPN)
- State-of-the-art efficiency across model sizes
Transformer-Based Detectors
Modern approaches using attention mechanisms:
DETR (Detection Transformer):
- End-to-end detection without hand-designed components
- No NMS or anchor boxes needed
- Bipartite matching loss
- Slower convergence than CNN-based methods
Deformable DETR:
- Addresses DETR's slow convergence and high memory usage
- Deformable attention for efficient multi-scale features
- 10× faster convergence
DINO (DETR with Improved DeNoising Anchor Boxes):
- Enhanced training techniques
- State-of-the-art accuracy
- Better small object detection
Speed vs. Accuracy Trade-offs
Real-time (30+ FPS):
- YOLOv5/v8 (small/medium variants)
- SSD with lightweight backbones
- Use for: Video processing, edge devices, robotics
Balanced (5-15 FPS):
- YOLOv8 (large variants)
- EfficientDet
- RetinaNet
- Use for: General applications, moderate real-time needs
High accuracy (< 5 FPS):
- Cascade R-CNN
- Large DETR variants
- Ensemble methods
- Use for: Offline processing, critical accuracy requirements
Evaluation Metrics
Mean Average Precision (mAP)
The primary metric for object detection, combining precision and recall across classes:
Calculation steps:
- For each class, compute Precision-Recall curve
- Calculate Average Precision (AP) as area under PR curve
- Average AP across all classes to get mAP
Common variants:
- mAP@0.5: IoU threshold of 0.5 (PASCAL VOC metric)
- mAP@0.5:0.95: Average mAP at IoU thresholds from 0.5 to 0.95 (COCO metric, more strict)
- mAP@0.75: Stricter IoU requirement
where N is the number of classes.
Precision and Recall
Precision: Fraction of detections that are correct
Recall: Fraction of ground-truth objects detected
A detection is a True Positive if:
- Class prediction is correct
- IoU with ground truth ≥ threshold (typically 0.5)
Precision-Recall Curve
Plots precision vs. recall at different confidence thresholds:
- High threshold: Few detections, high precision, low recall
- Low threshold: Many detections, low precision, high recall
- Ideal: Maintains high precision across all recall levels
- AP: Area under this curve
Object Size Metrics (COCO)
COCO dataset provides size-specific metrics:
- mAP^small: Objects with area < 32² pixels
- mAP^medium: Objects with 32² < area < 96² pixels
- mAP^large: Objects with area > 96² pixels
Useful for understanding model behavior on different object scales.
Frames Per Second (FPS)
Inference speed metric:
- Measured on specific hardware
- Higher is better for real-time applications
- Trade-off with accuracy
- Consider full pipeline (preprocessing + model + postprocessing)
Annotation Formats
COCO Format (JSON)
Microsoft COCO dataset format, widely used:
{
"images": [{"id": 1, "file_name": "image.jpg", "width": 640, "height": 480}],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [x, y, width, height],
"area": width * height,
"iscrowd": 0
}
],
"categories": [{"id": 1, "name": "person"}]
}Characteristics:
- Supports segmentation masks in addition to boxes
- (x, y) is top-left corner
- Single JSON file for entire dataset
PASCAL VOC Format (XML)
XML files, one per image:
<annotation>
<filename>image.jpg</filename>
<size><width>640</width><height>480</height></size>
<object>
<name>person</name>
<bndbox>
<xmin>100</xmin><ymin>100</ymin>
<xmax>200</xmax><ymax>300</ymax>
</bndbox>
</object>
</annotation>Characteristics:
- One XML file per image
- (xmin, ymin, xmax, ymax) format
- Human-readable
YOLO Format (Text)
Simple text files, one per image:
class_id x_center y_center width height
0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25Characteristics:
- All values normalized to [0, 1]
- One line per object
- Separate classes.txt file lists class names
- Extremely simple and fast to parse
Conversion Tools
Most frameworks provide conversion utilities:
- COCO ↔ YOLO converters widely available
- PASCAL VOC conversion supported in most libraries
- Custom formats can be converted with scripting
Data Requirements
Dataset Size
Depends on task complexity and approach:
- Transfer learning: 100-1000 annotated images (minimum)
- Training from scratch: 10,000+ images recommended
- Fine-tuning: Can work with smaller datasets (50-500 images)
- Complex scenes: More data needed for multi-object scenarios
Annotation Quality
High-quality annotations are critical:
- Tight bounding boxes: Minimize background pixels
- Consistent labeling: Same objects labeled the same way
- Complete annotations: Don't miss objects in crowded scenes
- Handle occlusion: Annotate partially visible objects
- Class definitions: Clear guidelines for edge cases
Data Diversity
Model needs to see varied examples:
- Multiple viewpoints: Front, side, angled views
- Various scales: Near and far objects
- Different lighting: Indoor, outdoor, day, night
- Backgrounds: Cluttered and clean environments
- Occlusion levels: Fully visible and partially hidden objects
Class Balance
Aim for balanced representation:
- Similar number of instances per class
- If imbalanced, use weighted losses or oversampling
- Monitor per-class metrics during training
- Consider combining rare classes
Common Challenges
Small Object Detection
Objects occupying few pixels are hard to detect:
- Problems: Low resolution, little detail, fewer features
- Solutions:
- Multi-scale feature pyramids (FPN)
- Higher input resolution
- Specialized small-object detectors
- Tile-based processing for very high-res images
- Anchors: Design smaller anchor boxes
- Augmentation: Careful with downscaling
Occlusion and Crowding
Objects partially hidden or overlapping:
- Problems: Incomplete visual information, ambiguous boundaries
- Solutions:
- Train on occluded examples
- Attention mechanisms to focus on visible parts
- Context modeling to infer hidden portions
- Soft-NMS to preserve overlapping detections
- Annotation: Label even partially visible objects
Class Imbalance
Some classes appear far more frequently:
- Problems: Model biased toward common classes, poor rare class performance
- Solutions:
- Focal loss to down-weight easy examples
- Class-balanced sampling during training
- Weighted losses
- Oversample rare classes or undersample common ones
- Evaluation: Report per-class metrics, not just overall mAP
Speed vs. Accuracy Tradeoff
Real-time requirements conflict with accuracy goals:
- Analysis: Profile your application's speed requirements
- Solutions:
- Choose appropriate architecture for your use case
- Model compression: Quantization, pruning
- Hardware acceleration: TensorRT, ONNX Runtime
- Resolution reduction (carefully)
- Testing: Measure actual inference time on target hardware
Background False Positives
Model detects objects where none exist:
- Problems: Cluttered backgrounds, similar patterns
- Solutions:
- Add "background" or "negative" training examples
- Adjust confidence thresholds
- Focal loss reduces focus on easy negatives
- Hard negative mining
- Augmentation: Include challenging background images
Domain Shift
Performance drops in new environments:
- Problems: Different camera angles, lighting, image quality
- Solutions:
- Include diverse training data
- Domain adaptation techniques
- Test-time augmentation
- Fine-tune on target domain
- Validation: Test on data from deployment environment
Practical Applications
Autonomous Vehicles
- Pedestrian and vehicle detection
- Traffic sign and signal recognition
- Lane and road boundary detection
- Obstacle identification
Surveillance and Security
- Person detection and tracking
- Suspicious object identification
- Crowd monitoring
- Perimeter intrusion detection
Retail Analytics
- Customer counting and tracking
- Product recognition on shelves
- Queue length monitoring
- Inventory management
Manufacturing and Quality Control
- Defect detection on production lines
- Part identification and counting
- Assembly verification
- Safety equipment detection (hard hats, vests)
Medical Imaging
- Tumor detection in CT/MRI scans
- Organ localization
- Anatomical landmark detection
- Cell counting in microscopy
Agriculture
- Crop disease detection
- Fruit counting for yield estimation
- Weed detection
- Livestock monitoring
Wildlife Conservation
- Animal detection in camera traps
- Species identification
- Population counting
- Poaching detection
Augmented Reality
- Object recognition for AR overlays
- Spatial understanding
- Hand and gesture detection
- Face and facial landmark detection
Choosing an Approach
Consider these factors when selecting a detection method:
For real-time applications (robotics, video analytics):
- Use one-stage detectors: YOLOv8, SSD
- Accept slightly lower accuracy for speed
- Optimize for your specific hardware
- Consider edge-optimized variants (YOLO-nano, MobileNet-SSD)
For highest accuracy (medical, critical safety):
- Use two-stage detectors: Cascade R-CNN, Mask R-CNN
- Or large transformer models: DINO, Deformable DETR
- Can afford slower inference
- Ensemble multiple models if needed
For small objects (aerial imagery, microscopy):
- Feature pyramid networks essential
- Higher input resolution
- Specialized architectures like CascadeRCNN
- Consider tile-based processing
For limited data:
- Transfer learning from COCO-pretrained models
- Data augmentation critical
- Consider few-shot detection methods
- Active learning to prioritize annotation
For edge deployment (mobile, embedded):
- Lightweight architectures: MobileNet-SSD, YOLO-nano
- Model quantization and pruning
- Profile on actual target hardware
- Optimize preprocessing pipeline
Next Steps
Ready to train your own object detection models? Our Object Detection Training Guide provides comprehensive documentation on:
- Available architectures (YOLO, Faster R-CNN, RetinaNet, etc.)
- Hyperparameter configuration
- Data preparation and augmentation strategies
- Training optimization and debugging
For understanding related computer vision tasks, see:
- Image Classification - Single label per image
- Image Segmentation - Pixel-level understanding
- Computer Vision Overview - All vision tasks