CLIP ViT-L/14

CLIP ViT-L/14 from OpenAI encodes images into a shared embedding space with text, enabling zero-shot classification and cross-modal retrieval without labelled data.

When to use:

Image similarity search (find visually similar images)
Zero-shot image classification without labelled data
Cross-modal retrieval (find images matching a text query)

Input: Image file + optional fine-tuned checkpoint Output: 768-dimensional embedding vector in the CLIP feature space

Inference Settings

No inference-time settings. CLIP encodes images deterministically.

Note: To compare image embeddings to text queries, use a CLIP text encoder on the query side to get text embeddings in the same space.

CLIP ViT-L/14

Inference Settings

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

CLIP ViT-L/14

Inference Settings

On this page

Command Palette