Audio Tokens Part 5: Back to Basics
audio-tokensOK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor.
I’m going to try a super-simple network just to see if I can get something to fit.
Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens:
haha. It’s only a little worse than the BERT model. Is there even data in these tokens? Is this thing on? I probably need to spend some time getting some model to overtrain, just to make sure there’s something to this whole tokenization thing.
Same model, 500 tokens instead of 50:
Well, it’s overtraining, but mAP is maxing out beteween 0.2 and 0.25. So there’s still no evidence that there’s enough information in the token sequences to learn anything. It could also be that this intentionally-simple model is just too small for this task.
Now for some entertainment. Let’s try it with a vocabulary of…um…two tokens. The power of the binary compels you! Everything else stays the same.
That was fun.
Back to work. Let’s bump to 1000 tokens and see what happens:
OK now. That’s a little more interesting. Overtraining at a higher end point, while val mAP is almost exactly the same. Let’s see where this train goes. 5000 tokens:
Out of curiosity, what happens if I plug the BERT model back in here, with 5000 tokens and the same sequences? To start with, instead of taking 25 minutes for 100 epochs, it will take about 13 hours. (Spare H100 lying around? Call me!) Luckily in case things look bad early I have a foolproof early-stopping algorithm I call “Control-C”.
Back to something simpler then.