ViT Base
Vision Transformer Base model for image classification
Vision Transformer Base (ViT-Base-Patch16-224) splits images into 16×16 pixel patches and applies self-attention to classify them. Achieves strong accuracy and benefits significantly from fine-tuning on domain-specific images.
When to use:
- Custom image category classification after fine-tuning
- Medical imaging, product categorization, quality inspection
Input: Image file (PNG, JPG) + optional fine-tuned checkpoint Output: Predicted class label and confidence scores
Inference Settings
No dedicated inference-time settings. The model classifies images deterministically using the loaded checkpoint. Output class labels are determined by the categories in the fine-tuned model.