VI · MLOps & InfrastructureAdvanced

ONNX

Open Neural Network Exchange - an open format for representing ML models as portable computation graphs that can be exported from any training framework and run on any compatible runtime, enabling framework-agnostic deployment.

Added May 18, 2026 · 3 min read

ONNX is the lingua franca of ML deployment, enabling the ecosystem of training frameworks to coexist with a diverse ecosystem of hardware and serving runtimes. Without it, every training framework would need to build integrations with every deployment target, and every hardware vendor would need to support every training framework. ONNX's standardisation enables specialised inference optimisations to benefit any model, regardless of how it was trained.

The ML ecosystem is fragmented across training frameworks: PyTorch, TensorFlow, JAX, Keras, Caffe, MXNet. Each has its own model representation, operator library, and execution engine. A model trained in PyTorch cannot be directly loaded by TensorFlow, and vice versa. This creates friction when deploying models: the serving infrastructure may prefer a different framework than the research team used for training, or a hardware vendor's optimised runtime may not support the training framework.

ONNX (Open Neural Network Exchange) addresses this by defining an intermediate representation for ML models as a computation graph with standardised operators. A model trained in any framework can be exported to ONNX format. That ONNX file can then be loaded and executed by any ONNX-compatible runtime - without requiring the original training framework.

The ONNX model format represents a computation graph where nodes are operations (matrix multiply, convolution, activation functions, attention, etc.) and edges carry tensor values between them. ONNX maintains a standard operator specification (the ONNX opset) that defines the semantics of each supported operation. Both the export side (training frameworks generating ONNX) and the runtime side (inference engines consuming ONNX) must implement these operators consistently.

ONNX runtimes include ONNX Runtime (Microsoft's cross-platform inference engine, supporting CPUs, GPUs, and specialised accelerators), TensorRT (NVIDIA's optimised runtime that accepts ONNX as input), OpenVINO (Intel's inference toolkit), and CoreML (Apple's runtime for iOS/macOS). Hardware vendors can support the ONNX standard once and automatically support any model that exports to it, rather than implementing support for every training framework individually.

ONNX Runtime (ORT) has become particularly important as a deployment target. It applies graph optimisations to ONNX models - operator fusion (combining adjacent operations into a single fused kernel), constant folding (precomputing fixed values), and layout transformations - that can significantly improve inference speed without changing model behaviour. ORT also provides execution providers that delegate to hardware-specific backends: CUDA for NVIDIA GPUs, TensorRT for optimised NVIDIA execution, DirectML for Windows GPU acceleration, CoreML for Apple Silicon.

Limitations: not all PyTorch/TensorFlow operations have ONNX equivalents (dynamic control flow is particularly difficult), and some model features (like custom CUDA kernels) cannot be represented in standard ONNX. Dynamic shapes (variable sequence lengths) require careful handling. Despite these limitations, ONNX has become the standard interoperability format for deploying models beyond research environments.

Analogy

PDF is a format that allows documents created in any word processor to be viewed in any PDF reader, preserving the content and layout without requiring the reader to have the original application. ONNX plays the same role for ML models: it is a universal representation that allows a model created in any training framework to be executed on any compatible runtime, separating the creation environment from the deployment environment.

Real-world example

A research team trains a document understanding model in PyTorch using the Hugging Face library. The production deployment target is an ARM-based edge device running Intel OpenVINO. Using `torch.onnx.export()`, they export the model to ONNX format, then load it in OpenVINO's ONNX frontend. OpenVINO applies device-specific optimisations and compiles the graph for ARM execution. The deployed model runs 3x faster than PyTorch inference on the same device, without the research team needing to implement an OpenVINO-native model.

Why it matters

ONNX is the lingua franca of ML deployment, enabling the ecosystem of training frameworks to coexist with a diverse ecosystem of hardware and serving runtimes. Without it, every training framework would need to build integrations with every deployment target, and every hardware vendor would need to support every training framework. ONNX's standardisation enables specialised inference optimisations to benefit any model, regardless of how it was trained.

In the news

No recent coverage - search for ONNX.

Related concepts

Model Pruning

A model compression technique that removes unnecessary weights from a trained neural network - reducing model size and inference cost by identifying and deleting parameters that contribute minimally to the model's outputs.

Model Serving

The infrastructure layer that takes a trained ML model and makes it available to receive requests, run predictions, and return results at production scale - the bridge between a trained artifact and a live application.

← Back to concepts