One Double RoBERTa, with a Side of Strange
attention-encodings[The full demo tool, with the new model selection dropdown, is available for playing with here.]
[Previous posts here and here]
I added a model switcher to the attention-encoding demo: now you can toggle between roberta-base and roberta-large to compare how token representations evolve layer by layer.
The differences: roberta-base: 12 layers, hidden size 768, 12 attention heads roberta-large: 24 layers, hidden size 1024, 16 attention heads
So let’s try some of the previous roberta-base plots in roberta-large and see if large follows the same patterns.
Here are the cosine similarities for roberta-base:
And here are the similarities for roberta-large:
In roberta-large, the token embeddings seem to drift steadily away from their original values, without the short “return” we saw in roberta-base.
It almost looks like the larger model commits to its divergence earlier, and never looks back.
Here are the cosine similarities for roberta-base:
And here are the similarities for roberta-large:
Well now. That’s a little more interesting, isn’t it?
In roberta-base, token rankings relative to their original embeddings remain fairly stable (except for that wild jump at layer 5).
But in roberta-large, two things stand out:
-
There’s a smeared-out version of the roberta-base layer 5 discontinuity spread over several middle layers
-
Then, in the final third of the model, almost every token drifts far from its origin in relative terms
It appears that there’s something much more complex going on in this larger model. It’s not just scaling up, it’s doing something more complicated.
My assumption was that the same patterns would show up in the larger model, but instead everything just got blurry and more unstructured.
Scripting a post-script for the post
(Are polysemous jokes a form of dad joke? Asking for a friend.)
One thing, however, doesn’t change between the models.
You’d expect polysemous words such as flies (verb vs. noun) or like (preposition vs. verb) to drift apart as attention layers disambiguate their meanings in context.
But they don’t. Despite extremely different meanings, they stay surprisingly close together all the way through both models.
Maybe attention isn’t providing context in quite the way we think.
I might like to ponder a thing like that in another post.