What's Going On at Layer 5?

29 Apr 2025 attention-encodings

Continuing on from the last post, you might have had the same question I did:

What the heck, layer 5?

Let’s go back to the relative ranking plot. This chart shows how many original vocabulary token embeddings are closer (by cosine distance) to a token’s current encoding at each layer than the token’s own original embedding.

So if the current token encoding at position 1 is closest to its own original embedding, it has a rank of 1. If 100 other vocab embeddings are closer, the rank is 100.

At layer 5, some of the tokens take a huge leap in relative distance. And it’s worth noting just how big that leap is: RoBERTa has a vocabulary of 50,265 tokens, and at this layer, some tokens now have nearly 50,000 others that are closer to their original embedding.

One token—just a period (“.”)—has a rank of 50,099.

That’s...not subtle.

Then in layer 6 they snap back to where they were. Odd. Why just some, and not all?

Here’s the same plot with the jumping tokens highlighted:

Relative rank spike at layer 5 for function words, but not for content words.

Well now. There’s a pattern here. These are all function words without much semantic value (plus </s>, the end-of-sequence marker). The content words actually don’t do this jump.

This isn’t a quirk of this particular token sequence–it always happens with this model. Is this a funny trait of this particular RoBERTa checkpoint or some more general behavior?

I don’t have an answer. But that jump sure is something.

(Also worth noting: the final layer shows the inverse pattern—function words stay put, and content words leap.)