latentbrief
← Back to concepts

Concept

Edge AI

Running AI inference directly on edge devices - smartphones, IoT sensors, cameras, and embedded systems - rather than in the cloud, enabling real-time responsiveness, offline capability, and privacy by keeping data local.

Added May 18, 2026

Most AI systems assume that data is sent to a cloud server for inference: your photo goes to Google's servers, voice audio goes to Amazon's servers, medical sensor data goes to a hospital's cloud. Edge AI inverts this: inference runs on the device itself, whether that is a phone, a surveillance camera, an industrial sensor, a car's onboard computer, or a microcontroller in a wearable.

The case for edge inference is compelling along several dimensions. Latency: sending data to the cloud and receiving a response takes 50-200ms under good network conditions. Autonomous vehicle perception, industrial robot control, and augmented reality applications require sub-10ms inference - impossible with cloud round-trip latency. Bandwidth: video and sensor data is enormous; streaming it continuously to the cloud is expensive and sometimes impossible (rural IoT deployments, spacecraft, submarines). Privacy: many users and regulators are uncomfortable with audio, video, and biometric data leaving the device. Reliability: cloud-dependent systems fail when connectivity is lost; edge inference continues offline.

Deploying models on edge devices requires overcoming severe resource constraints. Even a flagship smartphone has 8-12GB of RAM and a mobile GPU a fraction as powerful as a data centre GPU. An IoT microcontroller might have 256KB of RAM and no GPU at all. Models designed for cloud inference typically require gigabytes of memory and millions of compute operations per inference - far beyond what edge hardware can accommodate.

Model compression techniques are the toolbox for edge deployment. Quantisation (reducing from FP32 to INT8 or even 4-bit) reduces model size and enables integer arithmetic that runs efficiently on edge hardware. Pruning removes unnecessary weights, reducing computation. Knowledge distillation trains small student models that mimic large teacher models, producing compact models optimised for the target task. Neural architecture search can discover architectures specifically efficient for the target hardware (MobileNet, EfficientNet, ShuffleNet are architectures specifically designed for mobile inference).

Edge ML frameworks support deployment across diverse hardware. TensorFlow Lite (mobile and embedded), ONNX Runtime (cross-platform), Core ML (Apple devices), and SNPE (Snapdragon Neural Processing Engine for Qualcomm) each optimise inference for specific hardware targets. Neural Processing Units (NPUs) - dedicated silicon for ML operations - are now standard in flagship smartphones (Apple Neural Engine, Google Tensor chip, Qualcomm Hexagon), dramatically improving edge inference performance per watt.

TinyML is the extreme end of edge AI: running inference on microcontrollers with kilobytes of memory. TensorFlow Lite Micro and Edge Impulse enable voice keyword detection, gesture recognition, and anomaly detection on hardware like Arduino, STM32, and Nordic nRF chips - enabling ML capabilities in devices that cost a few dollars and run on coin cell batteries for months.

Analogy

The difference between a centralised library and a personal bookshelf. A library has every book and can answer any question, but getting an answer requires a trip there. A personal bookshelf holds only the books you use most often, but answers common questions instantly without leaving home. Edge AI builds the personal bookshelf: a carefully curated, efficiently represented model that handles the most important inference tasks locally, without the round-trip cost and connectivity requirement of the centralised cloud library.

Real-world example

Apple's Face ID runs entirely on the iPhone's Neural Engine (part of the Apple A-series chip). Each unlock, the Neural Engine runs a neural network model that maps infrared depth-map facial geometry to a mathematical representation and compares it against the enrolled face template. The entire computation takes under 100ms, uses no internet connectivity (the biometric data never leaves the device), and works in complete darkness. Deploying this capability in the cloud would introduce unacceptable latency, require constant connectivity, and create a centralised database of facial biometrics - all problems that edge inference eliminates.

Why it matters

Edge AI is what makes AI applications viable in contexts with latency, bandwidth, privacy, or connectivity constraints - which describes the majority of real-world deployment environments outside data centres. Understanding edge AI - the compression techniques, hardware constraints, and deployment frameworks - is essential for building AI-powered consumer devices, IoT systems, industrial applications, and any use case where data cannot or should not leave the device.

In the news

Related concepts