Object Detection

Computer vision technology that identifies what objects are in an image and precisely locates each one using bounding boxes - the foundation of visual AI applications.

Added May 18, 2026 · 2 min read

Object detection is foundational to any AI system that needs to understand visual scenes rather than just classify single images. Autonomous vehicles, robotic manipulation, surveillance, medical diagnosis, and countless other applications require knowing what is present and where. The improvements in detection speed and accuracy over the past decade have been a prerequisite for making visual AI practical in real-time applications.

Object detection is the task of simultaneously answering two questions about an image: what objects are present, and where are they? The output is a set of bounding boxes - rectangular regions of the image - each labelled with a category and a confidence score. A single image might produce detections for people, cars, traffic lights, and other objects, each with its own box and score.

The history of object detection reflects the broader history of deep learning. Early deep learning detectors like R-CNN (2014) used a two-stage approach: first propose regions that might contain objects, then classify each region. This produced accurate results but was slow. Fast R-CNN and Faster R-CNN improved speed through shared convolutional features and a region proposal network, but two-stage approaches remained somewhat slow for real-time applications.

YOLO (You Only Look Once), introduced in 2015, demonstrated that detection could be done in a single forward pass: divide the image into a grid, have each grid cell predict bounding boxes and class probabilities simultaneously, and read out all detections at once. YOLO was dramatically faster than two-stage methods and enabled real-time detection on video. The YOLO family (YOLOv5, YOLOv8, YOLOv11) has continued to improve and remains the dominant choice for real-time applications.

Transformer-based detectors emerged after 2020. DETR (Detection Transformer) replaced the hand-crafted components of earlier detectors with a fully end-to-end transformer architecture: an image is encoded, a fixed set of learnable object queries attend to the encoded image, and each query predicts one object''s class and bounding box. DETR eliminated the need for hand-designed anchor boxes and non-maximum suppression post-processing, producing a cleaner and more flexible architecture.

Practical applications are pervasive: security camera monitoring, retail analytics (counting customers, detecting shoplifting), autonomous vehicle perception (detecting other vehicles, cyclists, pedestrians), medical imaging (detecting tumours, lesions, fractures), manufacturing quality control (detecting defects), and sports analytics (tracking players and equipment).

Analogy

A security guard monitoring a crowded space who, with a glance, identifies every person present and roughly where each one is standing - and can direct you to find person 23 (blue jacket, near the left exit). Object detection gives AI the same rapid, comprehensive spatial inventory of a visual scene.

Real-world example

Retail stores have deployed object detection at checkout to enable frictionless payment. Amazon Go stores use ceiling-mounted cameras running object detection models to track which items each customer picks up, charges their account automatically, and lets them walk out without a traditional checkout. The system tracks items and people simultaneously across hundreds of cameras.

Why it matters

Object detection is foundational to any AI system that needs to understand visual scenes rather than just classify single images. Autonomous vehicles, robotic manipulation, surveillance, medical diagnosis, and countless other applications require knowing what is present and where. The improvements in detection speed and accuracy over the past decade have been a prerequisite for making visual AI practical in real-time applications.

In the news

No recent coverage - search for Object Detection.

Related concepts

Image Segmentation

Computer vision technology that labels every pixel in an image according to what it belongs to - enabling AI to precisely identify and delineate objects, not just detect them.

Pose Estimation

Computer vision technology that detects the position of a person's body joints - enabling AI to understand human posture, movement, and gesture from video or images.

Vision-Language Models (VLM)

AI systems that understand both images and text together - reading pictures, answering questions about them, describing scenes, and reasoning across visual and linguistic content in a single model.

← Back to concepts