latentbrief
← Back to concepts

Concept

Graph Attention Network

A GNN variant that learns to assign different importance weights to different neighbours during aggregation - letting the model focus on the most relevant connections rather than treating all neighbours equally.

Added May 18, 2026

Graph Convolutional Networks aggregate information from all neighbours uniformly, assigning equal weight to every adjacent node regardless of how relevant or informative that neighbour is. In many real graphs, this is a poor assumption: in a citation network, a paper's most influential citations may be far more informative than incidental references; in a social network, close friends' opinions matter more than acquaintances'.

Graph Attention Networks (GAT), introduced by Velickovic et al. in 2018, address this by computing attention coefficients between every pair of adjacent nodes, determining how much each neighbour's features contribute to the current node's update. The attention mechanism is the same self-attention concept used in Transformers, adapted to the irregular neighbourhood structure of graphs.

The GAT layer computes, for each edge (i, j), an attention coefficient e_ij = LeakyReLU(a^T [W h_i || W h_j]), where h_i and h_j are the current feature vectors of nodes i and j, W is a shared linear transformation, a is a learned attention vector, and || denotes concatenation. These raw coefficients are normalised across all neighbours using softmax: alpha_ij = softmax_j(e_ij). The node's new representation is the weighted sum of transformed neighbour features: h'_i = sigma(sum_j alpha_ij W h_j).

Multi-head attention applies K independent attention mechanisms in parallel and concatenates (or averages) their outputs, exactly as in the Transformer architecture. Multiple attention heads allow the model to capture different types of relevance simultaneously - one head might focus on neighbours with similar feature vectors, another on neighbours in specific structural positions.

A key property of GAT: unlike GCN, the attention weights are functions of node features rather than fixed graph structure. This means a high-degree hub node does not automatically dominate (as it might in degree-normalised GCN) - if its features are irrelevant to the target node, attention weights will down-weight its contribution. GAT can also, in principle, handle inductive settings (new graphs at test time) if the attention function is feature-based and not tied to specific node identities.

GAT achieved state-of-the-art results on node classification benchmarks (Cora, Citeseer, PubMed) at the time of publication, and the architecture has inspired many variants: GATv2 (corrects an expressivity limitation in the original GAT attention), the Graph Transformer (applying full Transformer attention to graphs), and models that use edge features in attention computation.

Analogy

Reading a research paper and deciding how much to weight each cited reference when forming your own understanding of the topic. Rather than treating all citations as equally important, you pay more attention to the few papers that are most closely related and most widely cited for the specific claim you are evaluating. GAT applies this selective attention to graph aggregation: each node decides how much to attend to each of its neighbours based on the relevance of their features, rather than averaging all of them equally.

Real-world example

In a drug-protein interaction graph (nodes are drugs and proteins, edges connect drugs to their target proteins), a GAT predicting drug toxicity can learn to attend strongly to the protein targets that are most pharmacologically relevant to toxicity (e.g., liver enzymes) while down-weighting interactions with less relevant targets. The learned attention weights also serve as interpretable explanations: which protein interactions does the model consider most important for its toxicity prediction?

Why it matters

GAT demonstrated that attention mechanisms - the same idea powering Transformers in NLP - could be adapted to graph-structured data, enabling richer, more selective information aggregation. It remains a standard baseline for graph learning and introduced the idea of interpretable attention weights on edges, which has applications in scientific discovery where understanding which connections matter is as important as the prediction itself.

In the news

Related concepts