Image Segmentation
Train models for pixel-level classification and instance segmentation
Image segmentation extends beyond object detection by assigning class labels to every pixel in an image. There are three main types: semantic segmentation (labeling pixels by class), instance segmentation (separating individual object instances), and panoptic segmentation (combining both). These tasks enable precise scene understanding for applications like medical imaging, autonomous driving, and image editing.
Learn About Image Segmentation
New to segmentation? Visit our Image Segmentation Concepts Guide to learn about semantic vs instance segmentation, mask representations, common metrics like IoU and Dice score, and annotation best practices.
Available Models
DETR Segmentation Family
Transformer-based panoptic segmentation extending DETR's object detection capabilities with segmentation masks.
- DETR Segmentation ResNet-101 - Panoptic segmentation with ResNet-101 backbone
- DETR Segmentation ResNet-50 DC5 - Dilated convolutions for better small object segmentation
- DETR Segmentation ResNet-50 - Standard DETR segmentation variant
Foundation Models
Large pre-trained models designed for versatile segmentation with minimal fine-tuning.
- SAM (Segment Anything) - Promptable segmentation for any object with points, boxes, or masks
- Mask R-CNN - Classic instance segmentation extending Faster R-CNN
Semantic Segmentation
Models focused on pixel-level semantic classification without instance separation.
- SegFormer-B0 - Efficient hierarchical transformer for semantic segmentation
Common Configuration
Data Requirements
Training Images: Directory containing your images
Segmentation Masks: Either:
- Folder of masks (for semantic segmentation): PNG/numpy masks where pixel values represent classes
- COCO-format annotations (for instance segmentation): JSON with polygon or RLE masks
Mask Format Example (Semantic):
train_images/ segmentation_masks/
├── image1.jpg -> ├── image1.png (pixel values = class IDs)
├── image2.jpg -> ├── image2.png
└── image3.jpg -> └── image3.pngKey Training Parameters
Batch Size: Images processed together
- DETR segmentation: 2-4 (very memory-intensive)
- SAM: Inference only (no training)
- Mask R-CNN: 2-4
- SegFormer: 4-16 depending on variant
Epochs: Training iterations
- 1-10 epochs typical for fine-tuning
- Segmentation often needs more epochs than classification
Learning Rate: Optimization step size
- DETR: 1e-4 (higher than detection due to additional mask head)
- Mask R-CNN: 5e-3 (different optimizer)
- SegFormer: 6e-5 (very small)
Understanding Metrics
IoU (Intersection over Union): Overlap between predicted and ground truth masks
- Primary metric for semantic segmentation
- Calculated per-class then averaged (mIoU)
- Values: 0.0 (no overlap) to 1.0 (perfect match)
Dice Score: Harmonic mean of precision and recall for masks
- Often used in medical imaging
- Similar to IoU but more sensitive to small objects
- Formula: 2 x |A ∩ B| / (|A| + |B|)
Pixel Accuracy: Percentage of correctly classified pixels
- Simple to understand but can be misleading
- Dominated by large classes in imbalanced datasets
mAP (for instance segmentation): Mean Average Precision of masks
- Same as object detection but evaluated on masks not boxes
- More stringent than box-based mAP
Choosing the Right Model
By Segmentation Type
Semantic Segmentation (pixel-level classes, no instances)
- SegFormer-B0: Best accuracy-efficiency balance
- DETR Segmentation: If you want transformer-based approach
Instance Segmentation (separate object instances)
- Mask R-CNN: Industry standard, reliable
- DETR Segmentation: Modern transformer alternative
- SAM: For promptable segmentation
Panoptic Segmentation (both semantic + instances)
- DETR Segmentation models: Designed for this
- Combines "stuff" (background) and "things" (countable objects)
By Priority
Maximum Accuracy
- DETR Segmentation ResNet-101 (transformer power)
- Mask R-CNN with ResNet-101 backbone
- SegFormer larger variants
Fastest Training
- SegFormer-B0 (efficient architecture)
- Mask R-CNN (mature optimization)
- DETR variants (slower convergence)
Best for Small Objects
- DETR Segmentation ResNet-50 DC5 (dilated convs)
- Mask R-CNN with FPN
- SegFormer with high resolution
Interactive/Promptable
- SAM (designed for this - inference only)
- Other models need full retraining for new classes
By Use Case
Medical Imaging
- SegFormer-B0 or DETR Segmentation
- High accuracy critical
- Often semantic segmentation sufficient
Autonomous Driving
- DETR Segmentation for panoptic understanding
- Need both road surface (semantic) and vehicles (instance)
- Real-time requirements favor SegFormer
Image Editing/Annotation
- SAM for interactive segmentation
- Promptable approach ideal for user-guided tasks
Industrial Inspection
- Mask R-CNN for instance segmentation
- Reliable, well-tested in production
- Good for quality control defects
Best Practices
Data Preparation
-
Mask Quality: Pixel-perfect masks critical
- Accurate boundaries, no gaps
- Consistent annotation across dataset
- Include ambiguous regions appropriately
-
Class Balance:
- Balance pixel counts across classes
- Small classes need oversampling or weighted loss
- Background class often dominates - handle carefully
-
Instance Annotation:
- For instance segmentation, separate touching objects
- Consistent rules for occlusion
- Include partially visible instances if relevant
-
Resolution Considerations:
- Higher resolution captures fine details
- But requires more memory and compute
- Balance based on object sizes
Training Strategy
-
Start Conservative: Default hyperparameters usually good starting point
-
Monitor Multiple Metrics:
- IoU/Dice for segmentation quality
- Loss for training progress
- Per-class metrics to identify weak classes
-
Class Weights:
- Use weighted loss for imbalanced classes
- Emphasize difficult or rare classes
- Prevent background class from dominating
-
Augmentation:
- Random crops (ensure objects still visible)
- Flips and rotations (preserve mask-image alignment)
- Color augmentation (doesn't affect masks)
- Avoid transformations that misalign image and mask
Common Pitfalls
Background Dominance
- Background pixels often 80-90% of dataset
- Solution: Weighted loss, focal loss, or crop around objects
Boundary Errors
- Predictions often poor at object boundaries
- Solution: Boundary-aware loss, higher resolution, quality masks
Small Object Issues
- Tiny objects easily missed or poorly segmented
- Solution: Use DC5/dilated models, higher resolution, oversampling
Memory Problems
- Segmentation very memory-intensive
- Solution: Smaller batch sizes, gradient accumulation, lower resolution
Inconsistent Masks
- Training masks have inconsistent annotation style
- Solution: Quality control, clear guidelines, re-annotation if needed
Hardware Requirements
Memory Guidelines
Semantic Segmentation:
- 8GB minimum (SegFormer-B0)
- 12-16GB recommended
Instance Segmentation:
- 12GB minimum (Mask R-CNN)
- 16-24GB recommended for DETR variants
Batch Size Impact:
- Segmentation uses 2-4x memory of classification
- Batch size 2-4 typical even with large GPUs
- Consider gradient accumulation for larger effective batches
Training Time Estimates
Per Epoch (5,000 images):
- SegFormer-B0: 30-60 minutes
- Mask R-CNN: 1-2 hours
- DETR Segmentation: 3-5 hours
Times assume RTX 3080/4080 or better
Dataset Size Guidelines
Minimum: 500 images with quality masks Good: 2,000-5,000 images Excellent: 10,000+ images
Segmentation generally needs more data than classification or detection due to pixel-level supervision requirements.