What's Going On at Layer 5?
attention-encodingsContinuing on from the last post, you might have had the same question I did:
What the heck, layer 5?
Let’s go back to the relative ranking plot. This chart shows how many original vocabulary token embeddings are closer (by cosine distance) to a token’s current encoding at each layer than the token’s own original embedding.
So if the current token encoding at position 1 is closest to its own original embedding, it has a rank of 1. If 100 other vocab embeddings are closer, the rank is 100.
At layer 5, some of the tokens take a huge leap in relative distance. And it’s worth noting just how big that leap is: RoBERTa has a vocabulary of 50,265 tokens, and at this layer, some tokens now have nearly 50,000 others that are closer to their original embedding.
One token—just a period (“.”)—has a rank of 50,099.
That’s...not subtle.
Then in layer 6 they snap back to where they were. Odd. Why just some, and not all?
Here’s the same plot with the jumping tokens highlighted:
Well now. There’s a pattern here. These are all function words without much semantic value (plus </s>, the end-of-sequence marker). The content words actually don’t do this jump.
This isn’t a quirk of this particular token sequence–it always happens with this model. Is this a funny trait of this particular RoBERTa checkpoint or some more general behavior?
I don’t have an answer. But that jump sure is something.
(Also worth noting: the final layer shows the inverse pattern—function words stay put, and content words leap.)