This work summarizes the current techniques used to interpret the inner workings of transformer language models. As such, summarizing a summary is a challenge, but we will give it a try nonetheless.
The paper starts by introducing the transformer architecture and notation. The authors adopt the residual stream perspective regarding interpretability. In this view each input embedding gets updated via vector additions from the attention and feed-forward blocks, producing residual stream states. The final layer residual stream state is then projected into the vocabulary space via the unembedding matrix and normalized via softmax to obtain a distribution over our sampled output token.
Layer Norm
A common operation used to stabilize the training process, today commonly employed before the attention block due to empirical success. The layer norm operation given representation z is $((\frac{z-\mu(z)}{\sigma(z)}) \cdotp \gamma + \beta)$ where $\mu$ and $\sigma$ calculate the mean and std, and $\gamma \in \mathbb{R^d}$ and $\beta \in \mathbb{R}^d$ refer to learned element-wise transformation and bias respectively.
The authors discuss how layer norm can be interpreted by visualizing the mean subtraction as a projection onto a hyperplane defined by the normal vector $[1,1,1,1….,1] \in \mathbb{R}^d$ and the following scaling as a mapping of the resulting representation to a hypersphere (look into this further).
Attention Block
The attention block is composed of multiple attention head. At a decoding step i, each attnetion head reads from residual streams across previous position, decides which position to attend to, gathers information from those and finally writes it into the current residual stream.
Using tensor rearrangement operations one can simplify the analysis of residual stream contributions.