Variational Autoencoder (VAE)

A neural network that learns to compress data into a structured latent space and then reconstruct it - the compression engine that makes latent diffusion models fast enough to run locally.

Added May 18, 2026 · 3 min read

VAEs are the foundational architecture for generative modelling with structured latent spaces. Their combination of efficient compression and continuously navigable latent representations made them the standard choice for the encoding components of larger generative systems. In latent diffusion models specifically, VAEs are what make consumer-hardware image generation possible by reducing the computational cost of the diffusion process by orders of magnitude.

An autoencoder is a neural network that learns to compress data into a compact representation and then reconstruct it from that representation. The encoder maps high-dimensional input (an image with millions of pixels) to a low-dimensional representation (a compact vector or tensor). The decoder maps that representation back to the original space. Trained to minimise reconstruction error, the model is forced to learn the most information-preserving compression possible.

A variational autoencoder adds a probabilistic twist. Instead of encoding each input to a single fixed point in latent space, the VAE encodes it as a probability distribution - specifically, a Gaussian distribution defined by a mean and variance for each latent dimension. During training, a sample is drawn from this distribution (using the reparameterisation trick to allow gradients to flow through the sampling step) and passed to the decoder. The model is trained with two objectives: minimise reconstruction error as before, and keep the encoded distributions close to a standard normal distribution (the KL divergence regularisation term).

The KL regularisation is what makes the VAE's latent space useful for generation. By keeping all encoded distributions close to a standard normal, the model ensures that the latent space is continuously occupied - that you can sample a random point from the standard normal, decode it, and get something that looks like plausible data. In a regular autoencoder, the latent space may have large gaps where decoding produces garbage. The VAE's regularisation fills those gaps with interpolatable content.

For image generation, VAEs serve a critical compression role in latent diffusion models like Stable Diffusion. The VAE compresses images from, say, 512 x 512 x 3 = 786,432 pixel values down to 64 x 64 x 4 = 16,384 latent values - a 48-fold compression. The diffusion process operates in this compact latent space, then the VAE decoder maps the result back to pixel space. This is what makes local image generation fast: doing diffusion on 16,384 values is far more tractable than doing it on 786,432.

VAEs have also been applied in drug discovery, where the latent space structure allows interpolation between known molecules to explore novel chemical structures. In music generation, they encode audio into compact representations that capture style and structure. The controlled sampling and interpolation properties of VAE latent spaces make them useful anywhere you want to explore a data distribution smoothly.

Analogy

A very efficient filing system that organises documents not by their physical form but by their meaning, and ensures that similar documents are stored near each other. You can retrieve any document by its location, and by navigating between locations you can find documents with intermediate characteristics. The VAE maps data into a space with this property - structured, continuous, navigable.

Real-world example

Interpolating between two faces in the latent space of a face-VAE produces a smooth morph rather than a jarring transition. At the midpoint of the latent interpolation, you see a face that blends features from both sources. This smooth interpolation property only works because the VAE regularisation ensures the latent space is densely and continuously populated with valid face representations.

Why it matters

VAEs are the foundational architecture for generative modelling with structured latent spaces. Their combination of efficient compression and continuously navigable latent representations made them the standard choice for the encoding components of larger generative systems. In latent diffusion models specifically, VAEs are what make consumer-hardware image generation possible by reducing the computational cost of the diffusion process by orders of magnitude.

In the news

No recent coverage - search for Variational Autoencoder (VAE).

Related concepts

Diffusion Models

The generative AI technique behind Stable Diffusion and DALL-E 3 - which creates images by learning to reverse a process of gradually adding noise, turning pure static back into coherent pictures.

Latent Space

The internal numerical world where an AI model represents meaning - a high-dimensional space where similar concepts cluster together and mathematical operations on numbers produce meaningful semantic results.

Stable Diffusion

The open-source text-to-image model that made high-quality AI image generation widely accessible - running on consumer hardware and spawning an entire ecosystem of tools and applications.

← Back to concepts