Multimodal
Models that process or generate multiple modalities at once
Multimodal tasks work with multiple types of data at the same time.
Examples: text + images, text + audio, text + video.


Common Multimodal Tasks
- Image-Text-to-Text: Generate text from a combination of images and text prompts
- Visual Question Answering: Answer questions about images
- Document Question Answering: Answer questions from documents or PDFs
- Audio-to-Text: Convert audio or transcripts into coherent text outputs
- Video-to-Text: Generate text based on video content
- Visual Document Retrieval: Retrieve documents or visuals based on multimodal queries
- Any-to-Any: General multimodal conversion between arbitrary input and output types