Concept
Diffusion Models
The generative AI technique behind Stable Diffusion and DALL-E 3 - which creates images by learning to reverse a process of gradually adding noise, turning pure static back into coherent pictures.
Added May 18, 2026
Diffusion models are the technology that made AI image generation go from curiosity to cultural phenomenon. Stable Diffusion, Midjourney, DALL-E 3, and Adobe Firefly all use this approach at their core. Understanding how they work explains both why they produce such striking results and why they sometimes produce strange artefacts.
The training process starts with a simple, destructive operation: take a real image and progressively add random noise to it, one step at a time. After enough steps, the image looks like pure static - indistinguishable from random noise. This forward process is fixed and mathematical, not learned. What gets learned is the reverse: given a slightly noisy image, predict the noise that was added so you can subtract it and recover a slightly cleaner version.
A neural network - typically a U-Net or a transformer - is trained on millions of examples of this denoising task. Given a noisy version of an image at some step in the corruption process, predict the original noise or the original image. By training on images at all noise levels from slightly corrupted to pure static, the model learns a general sense of what images look like and what distinguishes coherent images from noise.
At generation time, the process runs in reverse. Start from pure random noise - just static. Ask the model to predict what noise was added and subtract it. Repeat many times, each step producing a slightly cleaner, more coherent image. After 20 to 1,000 steps (depending on the sampler and quality settings), a fully formed image emerges from the initial noise.
Text conditioning is added by training the model to denoise images conditioned on text descriptions. The model learns to prefer image states that match the text when choosing how to remove noise. The result is text-to-image generation: start from noise, guide the denoising process with a text prompt, and generate an image that matches the description.
The quality of modern diffusion model outputs reflects the enormous scale of training data and the sophistication of the architectures involved. Latent diffusion models, which do the diffusion in a compressed latent space rather than pixel space, dramatically reduce computational cost and are what makes Stable Diffusion fast enough to run locally.
Analogy
Sculpting by subtraction: starting from a rough block of marble (noise) and repeatedly removing material until the finished sculpture (image) emerges. The sculptor knows what needs to be removed at each step because they understand what good sculpture looks like. The diffusion model knows what noise to subtract because it has learned what real images look like.
Real-world example
When you type a prompt into Midjourney and see the image emerge from blurry static over several seconds, you are watching a diffusion model in action. The process starts from random noise, and each visible step is the model subtracting noise guided by your text prompt. The 20-50 visible steps you see are a compressed version of the hundreds of mathematical denoising steps happening underneath.
Why it matters
Diffusion models represent a step-change in AI-generated imagery quality. The images they produce are no longer obviously machine-made - they are indistinguishable from professional photography or illustration for many subjects. This creates enormous opportunities for creative production and equally enormous questions about authenticity, copyright, and misinformation. Understanding the mechanism is essential for evaluating both the potential and the risks.
In the news
No recent coverage - check back later.
Related concepts