CLIP ViT-L/14
Joint vision-language embedding model for image similarity and zero-shot tasks
CLIP ViT-L/14 from OpenAI encodes images into a shared embedding space with text, enabling zero-shot classification and cross-modal retrieval without labelled data.
When to use:
- Image similarity search (find visually similar images)
- Zero-shot image classification without labelled data
- Cross-modal retrieval (find images matching a text query)
Input: Image file + optional fine-tuned checkpoint Output: 768-dimensional embedding vector in the CLIP feature space
Inference Settings
No inference-time settings. CLIP encodes images deterministically.
Note: To compare image embeddings to text queries, use a CLIP text encoder on the query side to get text embeddings in the same space.