The Problem

Since Vaswani et al., 2017 [16] there have been many schemes introduced for encoding positional information in transformers. When applying self-attention to a given domain, the choice of position encoding typically involves tradeoffs between simplicity, flexibility, and efficiency. For example, learned absolute positional encoding is very simple, but may not generalize and are not always particularly meaningful due to the common practices of packing short sentences and phrases together in a single context and breaking up sentences across contexts.

Another major limitation of existing methods is that they do not work with efficient transformers. Methods like T5’s relative positional bias require constructing the full $N\text{x}N$ attention matrix between positions, which is not possible when using many of the efficient alternatives to softmax attention. A principled, easy to implement, and generally-applicable method for relative position encoding- one that works for both vanilla and “efficient” attention- is of great interest.

So for each token, we know where it is in the sequence. However, dot products (and therefore attention) do not preserve absolute positional information, so if we encode that positional information in the absolute position of the embeddings, we will lose a significant amount of information (like the original sinusoidal positional embeddings). On the other hand, dot products do preserve relative position, so if we can encode the absolute positional information into the token embeddings in a way that only leverages relative positional information, that will be preserved by the attention function. Rotary Positional Embedding (RoPE) is designed to address this need.

Rotary Positional Embeddings (RoPE)

Rotary Positional Embeddings is the state-of-the-art positional embedding technique for NLP. It is a type of position encoding that encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation.

Rotary encoding transforms pairs of features by rotating in the 2D plane. That is, it organizes the d features as $d/2$​ pairs. Each pair can be considered a coordinate in a 2D plane, and the encoding will rotate it by an angle depending on the position of the token. For example, let $x_m^{(1)}$ and $x_m^{(2)}$ be two features of the key or query of any attention head at position $m$. The position is just the number of the pair in the vector of feature pairs. The transformation is as follows:

RoPE(xm(1),xm(2),m)=(cosmθsinmθsinmθcosmθ)(xm(1)xm(2))RoPE(x_m^{(1)}, x_m^{(2)}, m) = \begin{pmatrix} \text{cos}m\theta & -\text{sin}m\theta \\ \text{sin}m\theta & \text{cos}m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} =(xm(1)cosmθxm(2)sinmθxm(1)sinmθxm(2)cosmθ)= \begin{pmatrix} x_m^{(1)}\text{cos}m\theta & -x_m^{(2)}\text{sin}m\theta \\ x_m^{(1)}\text{sin}m\theta & x_m^{(2)}\text{cos}m\theta \end{pmatrix}

where $\theta$ is a constant angle. $m$ allows for absolute position information. For 4 features, the rotation matrix would be as follows:

Rθ,md=(cosmθ1sinmθ100sinmθ1cosmθ10000cosmθ2sinmθ200sinmθ2cosmθ2)R^d_{\theta, m}= \begin{pmatrix} \text{cos}m\theta_1 & -\text{sin}m\theta_1 &0 &0 \\ \text{sin}m\theta_1 & \text{cos}m\theta_1 & 0 &0 \\ 0&0 & \text{cos}m\theta_2 & -\text{sin}m\theta_2 \\ 0&0 & \text{sin}m\theta_2 & \text{cos}m\theta_2 \end{pmatrix}

resulting in the rotation:

(cosmθ1sinmθ100sinmθ1cosmθ10000cosmθ2sinmθ200sinmθ2cosmθ2)(xm(1)xm(2)xm(3)xm(4))\begin{pmatrix} \text{cos}m\theta_1 & -\text{sin}m\theta_1 &0 &0 \\ \text{sin}m\theta_1 & \text{cos}m\theta_1 & 0 &0 \\ 0&0 & \text{cos}m\theta_2 & -\text{sin}m\theta_2 \\ 0&0 & \text{sin}m\theta_2 & \text{cos}m\theta_2 \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \\ x_m^{(3)} \\x_m^{(4)} \end{pmatrix}

Attention is also relative between a two feature pairs. This is because the dot product attention score between positions $m$ and $n$ can be multiplied out and simplified to:

<RoPE(xm(1),xm(2),m),RoPE(xm(1),xm(2),n)><RoPE(x_m^{(1)}, x_m^{(2)}, m), RoPE(x_m^{(1)}, x_m^{(2)}, n)> <RoPE(xm(1),xm(2),mn),RoPE(xm(1),xm(2),0)><RoPE(x_m^{(1)}, x_m^{(2)}, m-n), RoPE(x_m^{(1)}, x_m^{(2)}, 0)>

The intuition behind RoPE is that we can represent the token embeddings as complex numbers and their positions as pure rotations that we apply to them. If we shift both the query and key by the same amount, changing absolute position but not relative position, this will lead both representations to be additionally rotated in the same manner thus the angle between them will remain unchanged and thus the dot product will also remain unchanged. Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index. Therefore, by exploiting the nature of rotations, the dot product used in self-attention will have the property we are looking for, preserving relative positional information while discarding absolute position. This shows that for dot-production attention the rotary encodings gives relative attention.

In Practice

Rotary position embeddings are initialized along with the other parameters of the model. The rotation is applied to both queries and keys before the dot-product score is computed in the attention mechanism. For query $q_m$ in position $m$ and key $k_n$ in position $n$ the attention formula is:

qmTkn=(Rθ,mdqm)T(Rθ,ndkn)=qmTRθ,nmdknq_m^{T}k_{n}= (R^d_{\theta, m}q_m)^T(R^d_{\theta, n}k_{n})=q^T_mR^d_{\theta, n-m}k_n

Where $R_{\theta, n-m}=(R^d_{\theta, m})^TR^d_{\theta, n}$ and $R$ is the rotation matrix.

In each transformer layer, the rotary position embeddings are concatenated with the token embeddings (or output from the previous layer), providing additional information about the position of tokens. The concatenated embeddings are processed through the self-attention mechanism and feed-forward neural networks as in the standard transformer architecture. Finally, during training, the parameters of the rotary position embeddings are updated along with the other parameters of the model through backpropagation.

Implementation

Here is a simple PyTorch implementation, one that is must less efficient than the implementation suggested in the paper:

import torch
from torch import nn

B, T, C = 4, 16, 32 #B = Batch size, T = sequence length, C = embedding size

embedding = torch.randn((B, T, C))

q_weight = torch.randn((C, C))
k_weight = torch.randn((C, C))


# inputs to each transformer block prior to the attention mechanism:
qx = embedding @ q_weight 
kx = embedding @ k_weight 

nh = 4 # number of attention heads

# create linear projections of feature input into number of heads
qx = qx.reshape(B, T, nh, C // nh) #C // nh = hd
kx = kx.reshape(B, T, nh, C // nh)
qx.shape, kx.shape

So the shape of qx and kx are (torch.Size([4, 16, 4, 8]), torch.Size([4, 16, 4, 8])); 4 for the batch size, 16 for our sequence length, 4 for our number of heads, and embedding size/number of features per head. The rotary embeddings work on feature pairs within each head, so when applying RoPE to qx and kx, we break them down further into feature pairs of 2. So since we have 4 heads and each head has 8 features, we will convert each head to four pairs of two features.

hs = C // nh
qx_re = qx.reshape(B, T, nh, hs//2, 2)
kx_re = kx.reshape(B, T, nh, hs//2, 2)

So the shape is now torch.Size([4, 16, 4, 4, 2]).

Lets now compute the angles $\theta$. and then the rotation matrix $R$. The paper gives us the equation for computing the angles:

θi=100002(i1)d,i[1,...,d/2]\theta_{i}= 10000^{\frac{-2(i-1)}{d}} , i\in[1, ... ,d/2]
pairs = torch.arange(0, hs // 2)
idx_skipper = pairs*2

theta = 10000. ** ((-2.0 * torch.arange(0, hs // 2))/hs)
positions = torch.arange(1, T+1).unsqueeze(1) #aka m in the paper
m_theta = positions * theta
cos_values = torch.cos(m_theta)
sin_values = torch.sin(m_theta)

R = torch.zeros((T, C, C), requires_grad=False) # initialize R

# Now to slice in our values creating the sparse rotation matrix. 
R[:, idx_skipper, idx_skipper] = cos_values
R[:, idx_skipper, idx_skipper+1] = -sin_values
R[:, idx_skipper+1, idx_skipper] = sin_values
R[:, idx_skipper+1, idx_skipper+1] = cos_values

Rotating the entire query and key matrix:

qx = embedding @ q_weight 
kx = embedding @ k_weight 

qx_rotated = (qx.transpose(0,1) @ R).transpose(0,1)
kx_rotated = (kx.transpose(0,1) @ R).transpose(0,1)