latentbrief
← Back to concepts

Concept

Model Serving

The infrastructure layer that takes a trained ML model and makes it available to receive requests, run predictions, and return results at production scale - the bridge between a trained artifact and a live application.

Added May 18, 2026

Training a model is only half the job. Serving it - making it available to actually do work - is where most of the engineering complexity lives in production ML systems. Model serving is the discipline of deploying trained models as callable services that can handle real traffic with acceptable latency, reliability, and cost.

At its simplest, model serving means wrapping a model in an API endpoint: receive an input, run inference, return output. But production serving requires much more. The serving layer must handle concurrent requests efficiently, route traffic across multiple model replicas, manage GPU or CPU resources, and maintain consistent latency under variable load.

Servingframeworks like TorchServe, TensorFlow Serving, Triton Inference Server, and Ray Serve handle the common infrastructure concerns: model loading, request batching, health checks, metrics collection, and horizontal scaling. They also manage model versioning, allowing multiple versions of a model to run simultaneously for gradual rollouts or A/B testing.

A central tradeoff in model serving is latency versus throughput. Batching requests together increases GPU utilisation and throughput but adds latency. Real-time applications need low latency and may sacrifice throughput. Batch inference jobs can maximise throughput by processing large datasets without any latency requirement. Many production systems use a tiered approach: a fast path for latency-sensitive requests and an async batch path for large workloads.

Serving large language models introduces specific challenges. LLM serving frameworks like vLLM, TGI (Text Generation Inference), and SGLang handle the specific memory management challenges of autoregressive generation, where each token in a sequence must attend to all prior tokens. Techniques like PagedAttention (vLLM's core innovation) allow efficient GPU memory utilisation for variable-length generations, dramatically improving the number of concurrent users a single GPU instance can serve.

Modern serving infrastructure also handles continuous batching (adding new requests to an in-flight batch as prior requests complete), speculative decoding (using a fast draft model to speculatively generate tokens verified by a larger model), and prefix caching (reusing KV cache computations for shared prompt prefixes across requests).

Analogy

A restaurant kitchen can produce excellent dishes, but without a dining room, waitstaff, menu, reservations system, and the ability to handle many tables simultaneously, the food never reaches customers. Model serving is everything that turns a trained model - the recipe and the cook - into a restaurant that can serve hundreds of customers an evening without any single one waiting an unreasonable time.

Real-world example

When you send a message to a chatbot powered by an LLM, the response you see was produced by a serving system that: received your request via an API, scheduled it into a batch with other concurrent requests, allocated GPU memory for your specific context length, ran autoregressive generation token by token, streamed the output back as tokens were generated, and tracked latency and error metrics throughout. All of this happens in the 1-3 seconds before the first word appears.

Why it matters

A model that cannot be reliably served at scale has no practical value. Understanding model serving helps engineers choose the right infrastructure, diagnose latency and cost issues, and reason about the full ML system rather than just the model training phase. It also explains why deploying a large model is not as simple as running it on a laptop - production serving requires careful engineering to achieve acceptable economics.

In the news

No recent coverage - check back later.

Related concepts