Image Segmentation

Computer vision technology that labels every pixel in an image according to what it belongs to - enabling AI to precisely identify and delineate objects, not just detect them.

Added May 18, 2026 · 3 min read

Precise spatial understanding at the pixel level is what separates AI vision systems that understand a scene from those that merely detect categories. Autonomous vehicles need to know exactly where the road edge is, not approximately. Surgical robots need exact tissue boundaries. Satellite analysis needs precise building footprints. Segmentation is the enabling technology for any vision application where approximate is not good enough.

Object detection tells you what is in an image and roughly where it is, using bounding boxes. Image segmentation tells you what is in an image at the pixel level - exactly which pixels belong to each object or region. This distinction matters enormously in applications where precise boundaries are required.

There are two main variants. Semantic segmentation assigns a class label to every pixel - every pixel is labelled as one of: road, car, pedestrian, building, sky, vegetation, and so on. If there are two cars in the scene, semantic segmentation identifies all their pixels as "car" but does not distinguish between them. Instance segmentation goes further: it identifies each individual object as a separate instance, so the two cars are distinguished as car 1 and car 2, each with their own precise pixel mask.

The architecture evolution of segmentation networks mirrors the broader history of computer vision. Early approaches used fully convolutional networks - standard convolutional image classifiers modified to output per-pixel predictions rather than single-class scores. Skip connections, which re-introduce high-resolution spatial information from early layers into the later prediction, were crucial for producing clean boundaries rather than blurry region approximations. The U-Net architecture (encoder-decoder with skip connections) became standard for semantic segmentation.

The Segment Anything Model (SAM), released by Meta in 2023, represented a paradigm shift. Rather than training a model to segment specific categories, SAM was trained to segment anything - to produce precise masks for any region in an image that a user specifies through points, bounding boxes, or text prompts. SAM generalised across a remarkable range of domains - medical images, satellite photos, artwork, microscopy - because its training dataset was enormous and diverse. SAM 2 extended this to video.

Practical applications of segmentation include autonomous vehicle scene understanding (precisely where are the road boundaries, pedestrians, and other vehicles), medical image analysis (precisely which pixels belong to a tumour), satellite imagery analysis (precise building footprints, vegetation boundaries), background removal in photography, and augmented reality (precisely removing the foreground subject to composite into a different background).

Analogy

The difference between circling objects in a photograph with a pen versus carefully cutting each object out with scissors. A bounding box is the circle - approximate, fast. Segmentation is the careful cutting - precise, detailed, showing exactly which part of the image each object occupies.

Real-world example

Adobe Photoshop's 'Select Subject' and 'Remove Background' features use semantic segmentation to identify which pixels belong to the main subject and which belong to the background. The seamless automatic background removal that used to require hours of careful manual masking now happens in under a second, because a segmentation model identifies object boundaries at pixel level.

Why it matters

Precise spatial understanding at the pixel level is what separates AI vision systems that understand a scene from those that merely detect categories. Autonomous vehicles need to know exactly where the road edge is, not approximately. Surgical robots need exact tissue boundaries. Satellite analysis needs precise building footprints. Segmentation is the enabling technology for any vision application where approximate is not good enough.

In the news

No recent coverage - search for Image Segmentation.

Related concepts

Object Detection

Computer vision technology that identifies what objects are in an image and precisely locates each one using bounding boxes - the foundation of visual AI applications.

Pose Estimation

Computer vision technology that detects the position of a person's body joints - enabling AI to understand human posture, movement, and gesture from video or images.

Vision-Language Models (VLM)

AI systems that understand both images and text together - reading pictures, answering questions about them, describing scenes, and reasoning across visual and linguistic content in a single model.

← Back to concepts