More Attention to Attention

22 Apr 2025 attention-encodings

[Note: the full demo tool is available for fiddling with here.]

A few months ago, I built a demo to visualize how token representations evolve inside the RoBERTa transformer model. Like a lot of people, I had the vague impression that transformers gradually contextualize each token layer by layer, refining its meaning as it flows through the network. But the more I looked, the less that held up. Depending on how I measured representational change (cosine similarity, vocabulary ranking, nearest neighbor overlap), I got completely different stories. This post is a short walkthrough of what I found, and how it challenged my assumptions about how and when transformers “make meaning.”

I’m using the phrase “Time flies like an arrow. Fruit flies like a banana.”, since it’s small and has some interesting disambiguation issues with “flies” and “time”.

Cosine Similarity - Early Drift, then Stabilization

First let’s look at cosine similarities to the original encoding. After each attention layer, after the individual heads are concatenated together and projected into a residual encoding, that residual is added to the previous encoding at that position in the sequence. This plot shows the cosine similarity of the attention output at each layer to the original token encoding for the first token “Time”.

This is based on the attention+residual output, after positional encodings have been added at layer 0 (as are all of the metrics in this post). You can see a quick move away from the original encoding, then a slight walkback in the later layers. Fast movement in the first half, smaller motion in a different direction in the second half. Note that the pattern is roughly the same for the other tokens as well.

Relative Cosine Similarities - Stable Until a Final Jump

Second, let’s take a rankings comparison. This metric measures how close the token encoding output at each layer is to its original encoding relative to the other original vocabulary encodings. For example, a ranking of “0” means that for that token, at that layer, there are no other token encodings more similar to the original token encoding. A ranking of “12,000” would mean that there are 12,000 token encodings closer to the original encoding. Again, let’s look at “Time”:

In this case, it looks like the token stays pretty close to its original value, with minor exceptions, for the first 11 layers, then in the final layer moves far away from its original value, relative to all of the other vocabulary embeddings. It almost looks like the model is “cramming” some last-minute semantic reorganization before the final output.

(Also note that there are some large temporary shifts at layer 5 for other tokens. Those are mostly function words, and I may touch on those in a later post.)

Closest 100 Encodings Churn Between Layers - Speed Up, Max Out, Slow Down

For the last (and strangest) metric, we can look at the relative rankings from the previous metric in a different way. For the specified token at the specified layer, take the closest 100 original vocabulary embeddings at each layer, and see how many of those remain in the closest 100 at the next layer. This could be another way to track motion of the token encodings between layers–how quickly is the meaning changing relative to the original token vocabulary. Here’s “Time” again:

This is interesting. As it moves up to layer 5, the list of 100 closest encodings changes faster and faster, to the point where for a couple of layers it seems to be “moving” so fast it doesn’t keep any of the same 100-closest terms between layers. Then, the rate of change gradually slows down, until by the final layers almost half of the nearest neighbors repeat in the nearest neighbor list. The model seems to be showing rapid change through the first half of the model and much slower change in the second half. Viewed from this angle, the model isn’t cramming at all.

Conclusion?

Each of these metrics seems to shine a very different light on what the attention mechanisms at each layer are doing to the token encodings as they travel through the RoBERTa model. Is a token encoding’s journey a steady refinement, a last-minute jump at the end, or it is accelerating to the halfway point then decelerating to its destination?

Note: I used “Time” as the example in all of these metrics, but the pattern holds for all of the tokens in the sequence, as you can see in the first two plots.

Want to explore? The full demo is here.

If you’ve looked at this kind of thing before, or found your own odd behaviors, I’d be curious to hear what you’ve seen.