Audio Tokens Part 5: Back to Basics

11 Sep 2024 audio-tokens

OK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor.

I’m going to try a super-simple network just to see if I can get something to fit.

Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens:

haha. It’s only a little worse than the BERT model. Is there even data in these tokens? Is this thing on? I probably need to spend some time getting some model to overtrain, just to make sure there’s something to this whole tokenization thing.

Same model, 500 tokens instead of 50:

train mAP over 0.2, val mAP peaking at 0.14 then dropping

Well, it’s overtraining, but mAP is maxing out beteween 0.2 and 0.25. So there’s still no evidence that there’s enough information in the token sequences to learn anything. It could also be that this intentionally-simple model is just too small for this task.

Now for some entertainment. Let’s try it with a vocabulary of…um…two tokens. The power of the binary compels you! Everything else stays the same.

That was fun.

Back to work. Let’s bump to 1000 tokens and see what happens:

train mAP maxing around 0.32, val mAP same as 500 tokens

OK now. That’s a little more interesting. Overtraining at a higher end point, while val mAP is almost exactly the same. Let’s see where this train goes. 5000 tokens:

train mAP at .55 and still rising, val mAP same as before

Well, well. That is some quality overfitting right there. Given that this model isn’t even using positional data, and doesn’t have much capacity, this either means that it’s just memorizing the appearance of individual tokens and/or there’s actual information in the tokens and we need to figure out if this is generalizable.

Out of curiosity, what happens if I plug the BERT model back in here, with 5000 tokens and the same sequences? To start with, instead of taking 25 minutes for 100 epochs, it will take about 13 hours. (Spare H100 lying around? Call me!) Luckily in case things look bad early I have a foolproof early-stopping algorithm I call “Control-C”.

Back to something simpler then.