latentbrief
← Back to concepts

Concept

Over-Smoothing

A fundamental limitation of deep graph neural networks where, after many message passing layers, all node representations converge to similar vectors - making deeper GNNs paradoxically worse at distinguishing nodes than shallow ones.

Added May 18, 2026

In standard deep learning, adding more layers generally improves performance by enabling more complex feature hierarchies: deeper CNNs recognise more complex visual patterns, deeper Transformers capture longer-range dependencies. Graph neural networks behave differently: beyond a small number of layers (typically 2-4), adding more GNN layers often hurts performance. The culprit is over-smoothing.

Each message passing layer averages a node's features with its neighbours' features. After one layer, each node has a representation blending 1-hop neighbourhood features. After two layers, it blends 2-hop features. After L layers, it has integrated information from all nodes within L hops. For large L, the receptive field covers most of the graph, and all nodes' representations become dominated by the same global average features - the graph becomes "over-smoothed" and node representations lose their discriminative power.

Mathematically, repeated mean aggregation is equivalent to repeated application of the graph diffusion operator. Graph diffusion has a well-known property: in the limit of infinite applications, the signal on every node converges to the same stationary distribution determined by node degrees. Any information about individual node identity is washed out. This is precisely why over-smoothing occurs.

Over-smoothing limits the effective depth of GNNs. For most node classification tasks, 2-layer GCNs outperform 4+ layer versions. This is a fundamental difference from CNNs and Transformers, where more layers usually help more. It also limits how far information can propagate across the graph in a single forward pass - a limitation that matters when tasks require integrating very distant information.

Mitigations include: residual connections (adding the input features to each layer's output, like ResNet - preventing complete overwriting of original features), initial residual connections (adding the original node features directly to each layer, maintaining access to the original signal), dense connections (connecting each layer to all previous layers, like DenseNet), jumping knowledge networks (concatenating representations from all layers for the final prediction), and DropEdge (randomly removing edges during training, reducing the smoothing effect).

Over-squashing is a related but distinct problem: as information from exponentially many nodes must pass through fixed-size bottleneck edges, long-range information gets compressed and distorted. While over-smoothing is about depth causing uniformity, over-squashing is about topological bottlenecks preventing distant information from reaching target nodes.

Analogy

Mixing paint colours repeatedly. Start with dots of red, blue, yellow, and green paint (distinct node representations). Blend each dot with its neighbours once - you get a variety of mixed colours (first GNN layer). Blend again - the variety decreases as mixed colours merge further. After many blending operations, you end up with a single greyish-brown hue everywhere (over-smoothed, indistinguishable representations). Residual connections are like mixing in a fixed amount of the original colour at each step, preserving some of the original distinctiveness despite the blending.

Real-world example

A 2-layer GCN achieves 81.5% accuracy on the Cora citation network node classification benchmark. A 4-layer GCN drops to 79.2%. An 8-layer GCN drops to 70.1%. Adding dropout and residual connections recovers some performance (4-layer with residuals: 82.1%), but the fundamental over-smoothing limitation means GCNs rarely benefit from going deeper than 4 layers, in contrast to ResNets that benefit from hundreds of layers.

Why it matters

Over-smoothing explains why GNN architectures do not simply benefit from depth the way CNNs and Transformers do - a surprising and important property that shapes how GNN architectures are designed and tuned. Understanding it also clarifies why certain graph problems (those requiring long-range information integration across many hops) are fundamentally hard for standard message-passing GNNs and motivate alternative approaches like graph transformers.

In the news

No recent coverage - check back later.

Related concepts