Image Segmentation

Image segmentation extends beyond object detection by assigning class labels to every pixel in an image. There are three main types: semantic segmentation (labeling pixels by class), instance segmentation (separating individual object instances), and panoptic segmentation (combining both). These tasks enable precise scene understanding for applications like medical imaging, autonomous driving, and image editing.

Learn About Image Segmentation

New to segmentation? Visit our Image Segmentation Concepts Guide to learn about semantic vs instance segmentation, mask representations, common metrics like IoU and Dice score, and annotation best practices.

Available Models

DETR Segmentation Family

Transformer-based panoptic segmentation extending DETR's object detection capabilities with segmentation masks.

DETR Segmentation ResNet-101 - Panoptic segmentation with ResNet-101 backbone
DETR Segmentation ResNet-50 DC5 - Dilated convolutions for better small object segmentation
DETR Segmentation ResNet-50 - Standard DETR segmentation variant

Foundation Models

Large pre-trained models designed for versatile segmentation with minimal fine-tuning.

SAM (Segment Anything) - Promptable segmentation for any object with points, boxes, or masks
Mask R-CNN - Classic instance segmentation extending Faster R-CNN

Semantic Segmentation

Models focused on pixel-level semantic classification without instance separation.

SegFormer-B0 - Efficient hierarchical transformer for semantic segmentation

Common Configuration

Data Requirements

Training Images: Directory containing your images

Segmentation Masks: Either:

Folder of masks (for semantic segmentation): PNG/numpy masks where pixel values represent classes
COCO-format annotations (for instance segmentation): JSON with polygon or RLE masks

Mask Format Example (Semantic):

train_images/          segmentation_masks/
├── image1.jpg    ->   ├── image1.png  (pixel values = class IDs)
├── image2.jpg    ->   ├── image2.png
└── image3.jpg    ->   └── image3.png

Key Training Parameters

Batch Size: Images processed together

DETR segmentation: 2-4 (very memory-intensive)
SAM: Inference only (no training)
Mask R-CNN: 2-4
SegFormer: 4-16 depending on variant

Epochs: Training iterations

1-10 epochs typical for fine-tuning
Segmentation often needs more epochs than classification

Learning Rate: Optimization step size

DETR: 1e-4 (higher than detection due to additional mask head)
Mask R-CNN: 5e-3 (different optimizer)
SegFormer: 6e-5 (very small)

Understanding Metrics

IoU (Intersection over Union): Overlap between predicted and ground truth masks

Primary metric for semantic segmentation
Calculated per-class then averaged (mIoU)
Values: 0.0 (no overlap) to 1.0 (perfect match)

Dice Score: Harmonic mean of precision and recall for masks

Often used in medical imaging
Similar to IoU but more sensitive to small objects
Formula: 2 x |A ∩ B| / (|A| + |B|)

Pixel Accuracy: Percentage of correctly classified pixels

Simple to understand but can be misleading
Dominated by large classes in imbalanced datasets

mAP (for instance segmentation): Mean Average Precision of masks

Same as object detection but evaluated on masks not boxes
More stringent than box-based mAP

Choosing the Right Model

By Segmentation Type

Semantic Segmentation (pixel-level classes, no instances)

SegFormer-B0: Best accuracy-efficiency balance
DETR Segmentation: If you want transformer-based approach

Instance Segmentation (separate object instances)

Mask R-CNN: Industry standard, reliable
DETR Segmentation: Modern transformer alternative
SAM: For promptable segmentation

Panoptic Segmentation (both semantic + instances)

DETR Segmentation models: Designed for this
Combines "stuff" (background) and "things" (countable objects)

By Priority

Maximum Accuracy

DETR Segmentation ResNet-101 (transformer power)
Mask R-CNN with ResNet-101 backbone
SegFormer larger variants

Fastest Training

SegFormer-B0 (efficient architecture)
Mask R-CNN (mature optimization)
DETR variants (slower convergence)

Best for Small Objects

DETR Segmentation ResNet-50 DC5 (dilated convs)
Mask R-CNN with FPN
SegFormer with high resolution

Interactive/Promptable

SAM (designed for this - inference only)
Other models need full retraining for new classes

By Use Case

Medical Imaging

SegFormer-B0 or DETR Segmentation
High accuracy critical
Often semantic segmentation sufficient

Autonomous Driving

DETR Segmentation for panoptic understanding
Need both road surface (semantic) and vehicles (instance)
Real-time requirements favor SegFormer

Image Editing/Annotation

SAM for interactive segmentation
Promptable approach ideal for user-guided tasks

Industrial Inspection

Mask R-CNN for instance segmentation
Reliable, well-tested in production
Good for quality control defects

Best Practices

Data Preparation

Mask Quality: Pixel-perfect masks critical
- Accurate boundaries, no gaps
- Consistent annotation across dataset
- Include ambiguous regions appropriately
Class Balance:
- Balance pixel counts across classes
- Small classes need oversampling or weighted loss
- Background class often dominates - handle carefully
Instance Annotation:
- For instance segmentation, separate touching objects
- Consistent rules for occlusion
- Include partially visible instances if relevant
Resolution Considerations:
- Higher resolution captures fine details
- But requires more memory and compute
- Balance based on object sizes

Training Strategy

Start Conservative: Default hyperparameters usually good starting point
Monitor Multiple Metrics:
- IoU/Dice for segmentation quality
- Loss for training progress
- Per-class metrics to identify weak classes
Class Weights:
- Use weighted loss for imbalanced classes
- Emphasize difficult or rare classes
- Prevent background class from dominating
Augmentation:
- Random crops (ensure objects still visible)
- Flips and rotations (preserve mask-image alignment)
- Color augmentation (doesn't affect masks)
- Avoid transformations that misalign image and mask

Common Pitfalls

Background Dominance

Background pixels often 80-90% of dataset
Solution: Weighted loss, focal loss, or crop around objects

Boundary Errors

Predictions often poor at object boundaries
Solution: Boundary-aware loss, higher resolution, quality masks

Small Object Issues

Tiny objects easily missed or poorly segmented
Solution: Use DC5/dilated models, higher resolution, oversampling

Memory Problems

Segmentation very memory-intensive
Solution: Smaller batch sizes, gradient accumulation, lower resolution

Inconsistent Masks

Training masks have inconsistent annotation style
Solution: Quality control, clear guidelines, re-annotation if needed

Hardware Requirements

Memory Guidelines

Semantic Segmentation:

8GB minimum (SegFormer-B0)
12-16GB recommended

Instance Segmentation:

12GB minimum (Mask R-CNN)
16-24GB recommended for DETR variants

Batch Size Impact:

Segmentation uses 2-4x memory of classification
Batch size 2-4 typical even with large GPUs
Consider gradient accumulation for larger effective batches

Training Time Estimates

Per Epoch (5,000 images):

SegFormer-B0: 30-60 minutes
Mask R-CNN: 1-2 hours
DETR Segmentation: 3-5 hours

Times assume RTX 3080/4080 or better

Dataset Size Guidelines

Minimum: 500 images with quality masks Good: 2,000-5,000 images Excellent: 10,000+ images

Segmentation generally needs more data than classification or detection due to pixel-level supervision requirements.

Image Segmentation

Available Models

DETR Segmentation Family

Foundation Models

Semantic Segmentation

Common Configuration

Data Requirements

Key Training Parameters

Understanding Metrics

Choosing the Right Model

By Segmentation Type

By Priority

By Use Case

Best Practices

Data Preparation

Training Strategy

Common Pitfalls

Hardware Requirements

Memory Guidelines

Training Time Estimates

Dataset Size Guidelines

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Image Segmentation

Available Models

DETR Segmentation Family

Foundation Models

Semantic Segmentation

Common Configuration

Data Requirements

Key Training Parameters

Understanding Metrics

Choosing the Right Model

By Segmentation Type

By Priority

By Use Case

Best Practices

Data Preparation

Training Strategy

Common Pitfalls

Hardware Requirements

Memory Guidelines

Training Time Estimates

Dataset Size Guidelines

On this page

Command Palette