Concept
ControlNet
A technique that adds precise structural control to diffusion image generation - letting you specify exactly the composition, pose, or layout of an image through maps, sketches, or depth information.
Added May 18, 2026
Text-to-image diffusion models are powerful but imprecise. You can describe what you want in words, but controlling exactly where elements appear, the pose of a person, the perspective of a scene, or the structure of a composition is extremely difficult through text alone. ControlNet solves this by adding a conditioning pathway that accepts structured spatial information alongside the text prompt.
ControlNet works by adding a trainable copy of the diffusion model's encoder to the architecture. This trainable copy processes the control input - a depth map, edge map, pose skeleton, segmentation mask, or other structured spatial signal - and injects conditioning information at each corresponding layer of the main diffusion model. The main model's weights are frozen, so its general image quality is preserved; the ControlNet branch adds spatial guidance on top.
The types of control signals available reflect a wide range of use cases. Canny edge control lets you draw a rough line sketch and generate a photo-realistic image with that precise composition. Human pose skeletons let you specify the exact posture of a person and generate them in any style. Depth maps let you specify the spatial layout of a scene - which elements are near and far - and generate images that respect that depth structure. Segmentation maps let you specify which regions should be sky, building, road, and so on.
The training approach is elegant: train the ControlNet branch on pairs of images and their corresponding control signals, derived automatically from a large image dataset. Depth maps are computed from images using a depth estimation model. Edge maps are computed using standard computer vision algorithms. Human poses are detected using pose estimation models. This allows training without any manually labelled data.
ControlNet transformed professional use of diffusion models. Design workflows that needed precise spatial control - architectural visualisation, product design, fashion design, game asset creation - became viable with diffusion tools for the first time. A designer could sketch a rough composition and generate photorealistic versions that preserve their structural intent.
Analogy
The difference between describing a building to an architect in words versus handing them a floorplan. The verbal description might get you a building that captures the feeling you want. The floorplan gets you one with exactly the room layout, dimensions, and relationships you need. ControlNet adds the floorplan capability to AI image generation.
Real-world example
Architectural firms have adopted ControlNet for early-stage design visualisation. A designer sketches a building facade, generates a depth map from a reference photo, and uses ControlNet to produce photorealistic renderings in multiple architectural styles and materials that precisely follow their structural composition. What previously required a professional renderer working for hours now takes seconds per variant.
Why it matters
ControlNet made AI image generation a viable professional tool rather than just a novelty. Creative professionals need spatial control - rough ideas expressed as sketches, layouts, or compositions. Without this control, AI generation is too unpredictable for production workflows. ControlNet bridged the gap between the generative power of diffusion models and the precise control needs of professional creative work.
In the news
No recent coverage - check back later.
Related concepts