LLaVA-Next 13B Embeddings
Joint image-text embeddings from LLaVA-Next for multimodal retrieval
LLaVA-Next generates joint embeddings from images and optional text prompts. The 13B parameter model produces 4096-dimensional vectors combining visual and language understanding for retrieval and similarity search.
When to use:
- Cross-modal retrieval (search images using text and vice versa)
- Building multimodal search indexes
- Semantic image similarity with text context
Input:
- Image (required): Image to encode
- Text (optional): Text prompt to pair with the image for joint embedding
Output: 4096-dimensional joint embedding vector
Inference Settings
No inference-time settings. Embeddings are computed deterministically.