Concept

Multimodal AI

An AI system that can work with more than just text - handling images, audio, and video alongside written language, and reasoning across all of them together.

The first wave of widely used AI language models could only work with text. You typed something in and got text back. This was powerful, but it also meant the AI was cut off from most of how information actually exists in the world. Documents contain charts and diagrams. Products have photos. Conversations happen over voice. Interfaces are visual. A text-only AI cannot engage with any of that.

Multimodal AI lifts that restriction. A multimodal model can look at an image and describe what it sees, read text in a photo, interpret a graph, watch a short video clip, or listen to audio - and it can do all of this in the same conversation as regular text, mixing and combining these inputs naturally. You might paste in a screenshot of an error message and ask what is wrong. Or share a photo of a dish and ask for the recipe. Or upload a PDF with charts and ask for a summary of the trends.

Making this work required teaching the model to encode different types of information - pixels in an image, waveforms in audio - into the same kind of numerical representation it uses for text. Once everything is in the same format, the model can reason across them together. This is harder than it sounds and requires specific training on large datasets of matched content: images paired with descriptions, audio paired with transcripts, and so on.

The most immediately practical impact has been on document and image understanding. Models like GPT-4 Vision, Claude 3, and Gemini can read scanned documents, interpret scientific figures, describe photographs, and extract information from tables in images - tasks that text-only AI simply could not do. This opens up enormous amounts of real-world content that was previously inaccessible.

Looking further ahead, multimodality is also what makes AI agents more capable. An agent that can see what is on a computer screen can do things that a text-only agent cannot. Reading a web page as actual rendered visual content, navigating a user interface by looking at it rather than reading raw code - these capabilities depend on multimodal AI.

Analogy

A person who can only read is limited in how they can understand the world. Add the ability to see, hear, and watch, and the range of tasks they can help with expands dramatically. Multimodal AI is the difference between an assistant who can only read memos and one who can also look at a whiteboard, listen to a meeting recording, and review a slideshow.

Real-world example

If you photograph a page of handwritten notes and send it to Claude with the question "what are the key action items here?", the model reads your handwriting, understands the content, and gives you a structured list. This is multimodal AI in a practical form - input that is not text, processed like text.

Why it matters

Multimodality is what brings AI closer to the full range of human perception. Most of the information people work with every day is not pure text - it is documents with images, interfaces with visuals, conversations with voice. AI that can handle all of these is qualitatively more useful than AI that handles only one.

In the news

Related concepts

Foundation Model Inference Transformer

← Back to concepts