Dokumentation (english)

Zero-Shot Image Classification

Train models to classify images into novel categories without explicit training examples

Zero-shot image classification represents a paradigm shift in computer vision, enabling models to recognize and classify images into categories they have never seen during training. Unlike traditional classification that requires hundreds of labeled examples per class, zero-shot learning leverages semantic relationships, learned representations, and few-shot episodic training to generalize to entirely new classes. This approach is revolutionary for applications with rare categories, rapidly evolving taxonomies, long-tail distributions, or scenarios where collecting training data is expensive or impractical.

Learn About Zero-Shot Image Classification

New to zero-shot learning? Visit our Zero-Shot Image Classification Concepts Guide to learn about few-shot learning, metric learning, prototypical networks, episodic training, support/query sets, and N-way K-shot evaluation.

Available Models

Metric Learning-Based Models

Learn embedding spaces where similar images cluster together, enabling classification through distance metrics.

Common Configuration

Data Requirements

Training Images: Directory containing training images organized by class

  • Base classes: Categories used during meta-training
  • Each class in its own subfolder
  • Minimum 20-50 images per class
  • Multiple classes needed (20+ recommended)

Episodic Training Structure: Unlike traditional classification, zero-shot models learn through episodes:

  • Each episode samples N classes (N-way)
  • K support examples per class (K-shot)
  • Q query examples to classify
  • Model learns from support set, evaluated on query set
  • Thousands of episodes during training

Directory Structure Example:

train_images/
├── class_1/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── class_2/
│   ├── image1.jpg
│   └── ...
└── class_N/
    └── ...

Novel Classes at Inference:

  • Provide few examples (1-5 shots) of new classes
  • Model classifies test images into novel categories
  • No retraining required

Key Training Parameters

Epochs: Number of meta-training iterations

  • 1-5 epochs typical for episodic training
  • Each epoch contains many episodes
  • More epochs for complex domains
  • Convergence typically faster than standard classification

Learning Rate: Optimizer step size for embedding learning

  • 0.001 typical starting point
  • Lower (0.0001) for fine-tuning existing models
  • Higher (0.01) for training from scratch
  • Metric learning sensitive to learning rate

Eval Steps: Evaluation frequency during training

  • 1 for epoch-level evaluation
  • More frequent for large datasets
  • Evaluates generalization to novel classes

Number of Ways (N): Classes per episode during training

  • 5-way typical (5 classes per episode)
  • Higher N makes task harder but improves generalization
  • Should match expected inference scenario

Number of Shots (K): Support examples per class

  • 1-shot: Most challenging, best generalization
  • 5-shot: Balanced difficulty
  • 10-shot: Easier, more stable prototypes
  • Train on various K for flexibility

Number of Query: Query examples per episode

  • 5-15 typical
  • More queries provide better gradient estimates
  • Balance with computational cost

Understanding Metrics

Accuracy: Primary metric for zero-shot classification

  • Percentage of correctly classified query images
  • Measured on novel classes not seen in meta-training
  • Higher is better, ranges from 0 to 1 (or 0% to 100%)
  • Compare against random baseline (1/N for N-way)

N-Way K-Shot Accuracy: Standard evaluation format

  • Example: 5-way 1-shot accuracy = 75%
  • Means: Given 5 novel classes with 1 example each, model correctly classifies 75% of test images
  • Different N/K provide different difficulty levels

Confusion Matrix: Per-class performance analysis

  • Shows which classes confused with each other
  • Useful for identifying similar categories
  • Helps debug poor performance

Embedding Quality Metrics:

  • Intra-class distance: How tightly examples of same class cluster
  • Inter-class distance: How separated different classes are
  • Ratio important for classification

Loss Metrics:

  • Prototypical loss: Distance-based loss in embedding space
  • Should decrease during training
  • Convergence indicates good embeddings learned

Choosing the Right Model

By Application Scenario

Rare or Emerging Categories

  • Prototypical Network ideal
  • New classes appear frequently
  • Examples: New product types, emerging species, novel diseases
  • Benefits from few-shot capability

Long-Tail Distribution

  • Many classes with few examples
  • Traditional classification impractical
  • Examples: Species recognition (rare animals), specialized medical conditions
  • Zero-shot handles rare classes naturally

Rapid Taxonomy Changes

  • Classification scheme evolves frequently
  • Retraining expensive or slow
  • Examples: Fashion trends, news categorization, dynamic product catalogs
  • Add new classes without retraining

Data Collection Expensive

  • Labeling costly or time-consuming
  • Expert knowledge required
  • Examples: Medical imaging, scientific research, specialized domains
  • Minimize labeling effort through few-shot

Personalization

  • User-specific categories
  • Each user defines their own classes
  • Examples: Personal photo organization, custom object recognition
  • Deploy same model for all users, customize with few examples

By Data Availability

Many Base Classes, Rich Data (50+ classes, 1,000+ images each)

  • Excellent for meta-training
  • Strong transferable representations
  • Expect high accuracy on novel classes
  • Can handle challenging few-shot scenarios (1-shot)

Moderate Base Classes (20-50 classes, 500+ images each)

  • Good for meta-training
  • Reasonable generalization
  • May prefer 5-shot over 1-shot
  • Acceptable performance

Limited Base Classes (<20 classes)

  • Challenging for meta-training
  • May need more shots at inference (5-10)
  • Consider transfer learning from pre-trained model
  • Generalization limited

Novel Class Similarity

  • If novel classes very different from base classes: Harder
  • If novel classes similar to base classes: Easier
  • Domain match between training and novel classes important

Best Practices

Data Preparation

  1. Diverse Base Classes: Meta-training needs variety

    • Wide range of visual concepts
    • Different object types, textures, shapes
    • Avoid very similar classes only
    • Diversity enables generalization
  2. Sufficient Examples per Class:

    • Minimum 20 images per class for meta-training
    • 50-100 images ideal
    • Quality matters more than quantity
    • Ensure class purity (no mislabeled images)
  3. Balanced Class Distribution:

    • Similar number of images per class preferred
    • Extreme imbalance can bias learning
    • If imbalanced, consider weighted sampling
  4. Image Quality and Consistency:

    • Consistent image quality across classes
    • Similar resolution and aspect ratios
    • Clean backgrounds helpful initially
    • Augmentation can add variety
  5. Separate Validation Classes:

    • Hold out some base classes for validation
    • Simulates novel class scenario
    • Never let validation classes leak into training
    • Essential for honest evaluation

Training Strategy

  1. Start with Pre-trained Embeddings: If available

    • Transfer learning accelerates meta-training
    • Better initial representations
    • Especially important with limited base classes
  2. Episode Configuration:

    • Start with 5-way 5-shot during meta-training
    • Gradually increase difficulty (more ways, fewer shots)
    • Match training episodes to expected inference scenario
  3. Monitor Validation Performance:

    • Evaluate on held-out novel classes
    • Check if accuracy plateaus
    • Compare to random baseline (20% for 5-way)
    • Ensure no overfitting to base classes
  4. Learning Rate Scheduling:

    • Start with standard rate (0.001)
    • Reduce if loss oscillates
    • Consider cosine annealing or step decay
    • Metric learning benefits from careful tuning
  5. Data Augmentation:

    • Moderate augmentation recommended
    • Rotation, flip, color jitter
    • Avoid augmentation that changes semantics
    • Helps generalization to novel classes

Common Pitfalls

Overfitting to Base Classes

  • Model memorizes base classes instead of learning transferable features
  • Symptoms: High accuracy on base classes, poor on novel classes
  • Solutions: More diverse base classes, regularization, fewer epochs

Insufficient Base Class Diversity

  • Novel classes too different from base classes
  • Symptoms: Poor generalization to novel domains
  • Solutions: Expand base class coverage, use pre-trained models

Inappropriate Episode Configuration

  • Training on 5-way 5-shot but testing 20-way 1-shot
  • Symptoms: Train/test mismatch, poor performance
  • Solutions: Match training episodes to deployment scenario

Class Imbalance

  • Some classes over-represented in episodes
  • Symptoms: Bias toward common classes
  • Solutions: Balanced episode sampling, class weighting

Poor Support Set Selection

  • Support examples not representative of class
  • Symptoms: Misclassification of query images
  • Solutions: Choose diverse, canonical support examples

Confusing Similar Classes

  • Novel classes visually very similar
  • Symptoms: High confusion between specific pairs
  • Solutions: More discriminative embeddings, more shots, better support examples

GPU Requirements

Memory Guidelines

Prototypical Network:

  • 4-8GB GPU sufficient for most configurations
  • Memory depends on batch size and image resolution
  • Episode-based training relatively memory-efficient
  • Can train on consumer GPUs

Typical Memory Usage:

  • 5-way 5-shot with batch size 4: ~4GB
  • 10-way 10-shot with batch size 2: ~6GB
  • Larger images or bigger batches need more memory

Training Time Estimates

Small Dataset (20 classes, 500 images/class):

  • 1 epoch: 30-60 minutes
  • 5 epochs: 2-5 hours
  • Episodes per epoch: ~1,000

Medium Dataset (50 classes, 1,000 images/class):

  • 1 epoch: 1-2 hours
  • 5 epochs: 5-10 hours
  • Episodes per epoch: ~5,000

Large Dataset (100+ classes, 2,000+ images/class):

  • 1 epoch: 3-6 hours
  • 5 epochs: 15-30 hours
  • Episodes per epoch: ~10,000+

Times assume modern GPU (RTX 3070/4070 or better)

Meta-training convergence typically faster than standard classification training.

Dataset Size Guidelines

Minimum: 15-20 base classes with 20+ images each Good: 30-50 base classes with 50-100 images each Excellent: 50-100+ base classes with 100+ images each

More diverse base classes lead to better generalization to novel classes.

Inference Workflow

Using Few-Shot Learning

  1. Prepare Support Set:

    • Collect 1-10 examples per novel class
    • Choose representative, high-quality images
    • More shots improve accuracy but increase computation
  2. Run Classification:

    • Model embeds support examples into feature space
    • Computes prototype (centroid) for each novel class
    • Embeds query image
    • Classifies based on nearest prototype
  3. Interpret Results:

    • Predicted class label
    • Confidence score (distance to prototype)
    • Can output top-K predictions
  4. Update Support Set:

    • Add/remove examples as needed
    • Refine prototypes for better accuracy
    • No retraining required

Performance Expectations

1-Shot (1 example per novel class):

  • Most challenging scenario
  • 50-70% accuracy typical (for 5-way)
  • Best for rapid adaptation
  • Sensitive to support example quality

5-Shot (5 examples per novel class):

  • Balanced scenario
  • 70-85% accuracy typical (for 5-way)
  • More robust prototypes
  • Recommended starting point

10-Shot (10 examples per novel class):

  • Easier scenario
  • 80-90% accuracy typical (for 5-way)
  • Very stable prototypes
  • Approaching few-shot upper bound

Accuracy degrades with more ways (more classes to distinguish):

  • 5-way 1-shot: 60%
  • 10-way 1-shot: 45%
  • 20-way 1-shot: 30%

Advanced Considerations

Domain Adaptation

  • Fine-tune on few examples from target domain
  • Helps when novel classes from different distribution
  • Few epochs sufficient
  • Maintains zero-shot capability

Class Hierarchies

  • Leverage taxonomic relationships
  • Coarse-to-fine classification
  • Helps with fine-grained distinctions
  • Improves interpretability

Active Learning

  • Model requests labels for most informative examples
  • Minimizes labeling effort
  • Prioritizes uncertain or boundary cases
  • Efficient data collection

Multi-Modal Learning

  • Combine vision with text descriptions
  • Semantic embeddings enable true zero-shot
  • Use class names or attributes
  • More flexible than few-shot alone

Continual Learning

  • Add novel classes incrementally
  • Avoid catastrophic forgetting
  • Update prototypes without full retraining
  • Scalable to many classes

Embedding Visualization

  • t-SNE or UMAP of learned embeddings
  • Verify good clustering and separation
  • Debug poor performance
  • Communicate model behavior

Comparison with Traditional Classification

When to Use Zero-Shot Learning

Advantages:

  • No retraining needed for new classes
  • Minimal examples required (1-10 vs 100s)
  • Handles long-tail distributions naturally
  • Rapid adaptation to novel categories
  • Lower data collection costs

Disadvantages:

  • Lower absolute accuracy than fully supervised
  • Requires diverse base classes for meta-training
  • More complex training procedure (episodic)
  • Less mature tooling and resources

When to Use Traditional Classification

Traditional is Better When:

  • Fixed, known set of classes
  • Abundant labeled data available (>100 examples/class)
  • Maximum accuracy critical
  • Classes very similar (fine-grained)
  • Simpler training preferred

Zero-Shot is Better When:

  • Classes evolve or emerge frequently
  • Limited examples per class (<50)
  • Many rare classes (long-tail)
  • Data collection expensive
  • Need rapid deployment for new categories

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items