Paying Attention Part 2
attention-encodingsA greatly-revised attention-encodings project is out and available here.
I realized in the previous incarnation of this project I was comparing per-head attention outputs to value-encodings over the entire model vocabulary, which was a clever idea. The problem is that I forgot about the residual connection around the attention mechanism. So the project was essentially comparing the residual “deltas” against the original encodings, which might make some sense if you were just wondering how “far” an encoding moves at each layer.
But it doesn’t quite fit what I was trying to get at, which was how attention layers change encodings through all the layers of a transformer model. What do token encodings “look like” after 10 or 12 attention mechanisms have beaten them up for a while?
It turns out that it depends on the type of word (I’m using “word” and “token” interchangably here, which isn’t great, but you’ll get the point). Content words, or words with meaning apart from syntactical meaning, move around a lot relative to all of the other vocabulary words, while function words (“a”, “the”, “in”, etc.) don’t move much at all.
At least, they move much more if you’re comparing them by ranking–how many words they’re closer to in the encoding space than their original encoding. They don’t move that much more if you measure the actual cosine distances.
But as far as relative meaning, the content words move around a lot. Which implies they’re moving in a much denser area in the encoding space than the function words.
Two other interesting things are the two leaps the model seems to take processing the encodings:
- Layer 5 is especially weird. The function words move around a lot all at the same time, then in layer 6 they move back to their original areas in the space. The content words seem unaffected by all the commotion. What’s going on there?
- Layer 11, the final layer, shows large movements in the content words, little in the function words. Is this the model cramming at the last minute to produce the loss-minimizing results? What was it doing in all those previous layers then?
Anyway, I’m glad I went back and fixed up this experiment. It makes more sense and raises some fun questions. Are those large leaps typical for transformers? Is there a way to reduce processing power used to work on function words, since they don’t seem to be affected by most layers of the model? Can we take some of that big learning happening in the final layer and make it happen earlier?