Vision Language Models
Models that understand both images and text for captioning, VQA, and document understanding
Vision-language models jointly process images and text for captioning, visual question answering, and structured document extraction.
Available Models
- BLIP-2 – Image captioning, visual question answering, and image-text retrieval
- LayoutLMv3 – Document understanding combining text, layout, and image for forms, receipts, and invoices