Concept

Sinusoidal Positional Encoding

The original method for telling transformer models where each token sits in a sequence - using mathematical sine and cosine waves to generate a unique position signal for every token.

Added May 18, 2026

The self-attention mechanism at the heart of transformers is position-agnostic by default: if you rearrange the tokens in a sequence, the attention computation produces the same result. This is a problem because word order is critical for meaning - "the dog bit the man" and "the man bit the dog" have identical words but opposite meanings. Without positional information, the model cannot distinguish them.

The original 2017 transformer paper solved this by adding a positional encoding vector to each token's embedding before processing. These vectors are constructed using sine and cosine functions at different frequencies: the first dimensions use very low frequencies (slowly changing sinusoids), and higher dimensions use progressively higher frequencies (rapidly oscillating sinusoids). Each position gets a unique pattern of values across all these frequencies.

The choice of sinusoids was deliberate. Each position has a unique encoding (no two positions produce the same pattern), and the encoding has a mathematical property the authors valued: for any fixed offset k, the encoding at position p+k can be expressed as a linear function of the encoding at position p. In theory, this allows the model to easily learn to attend to tokens a fixed number of positions away, regardless of absolute position.

Sinusoidal encodings are fixed - computed from a formula, not learned during training. This means they can be extended beyond the training sequence length without any additional learning, at least in principle. In practice, the model's other components are tuned to the training distribution of positions, so extreme extrapolation still degrades quality.

Sinusoidal encoding has been largely superseded by learned position encodings (in some architectures) and by RoPE (in most modern models), which encodes relative rather than absolute position. But understanding sinusoidal encoding is valuable because it was the original solution and its properties directly motivated the design of its successors.

Analogy

The way musicians tune an orchestra using a reference pitch. Every instrument adjusts to the same A440 Hz tone, creating a shared anchor. Sinusoidal position encoding is similar - it provides every token with a mathematically consistent "where am I in the sequence" signal by generating patterns that are unique to each position, using the universal language of sine waves.

Real-world example

The original Transformer paper ("Attention Is All You Need") used sinusoidal encodings and explicitly discussed why: they allow the model to attend by relative position by construction, and they generalise to sequence lengths longer than those seen in training. While subsequent work has moved beyond these specific encodings, the paper's analysis of what position encodings need to do remains the foundation for understanding all subsequent designs.

Why it matters

Sinusoidal position encoding's legacy is in setting the template for thinking about positional information in transformers. The questions it raised - how to represent absolute versus relative position, how to generalise to unseen sequence lengths, how to encode position efficiently - are still the questions that RoPE, ALiBi, and other modern position encodings are answering.

In the news

No recent coverage - check back later.

Related concepts

Rotary Position Embedding (RoPE)Self-Attention Transformer

← Back to concepts