Audio Tokens Part 6: Slightly Less Basic

12 Sep 2024 audio-tokens

In our last episode, I had managed to get a dead-simple model to overfit on the sequences when cranking up the number of tokens in the vocabulary. This probably means one of two things (or probably something in between):

The audio tokens have some useful information in them that can be generalized.
The audio tokens have no useful information in them, and the overfitting is just because the model is able to memorize the embedding averages when there are many more embeddings involved.

Let’s bet on the first one for now. Keep the spectrogram and token generation as-is, and try a slightly more complex model. Say hello to SimpleLSTMClassifier.

Running a bidirectional LSTM with embed_dim=128, hidden_dim=256, num_layers=1, and 5000 tokens (also bumping batch size up to 128 from 8 because this is a tiny model):

Interesting. The loss charts have outliers removed, and the differences aren’t very large on the y axis, but they’re still dropping after 100 epochs. mAPs still aren’t great, but for a single LSTM layer, it’s not so bad.

Let’s up the learning rate. It’s at 5e-5, so let’s crank it up to 1e-3.

train mAP goes to 1.0, val mAP maxes at 0.11 then falls

Now that’s what I call overfitting. Training mAP of 1.0, val peaking at 0.11. At this point I could try some regularization, I suppose, just to see if I can get the val mAP up at all. Or I could try changing the token generation. Let’s go with option 1: regularization. Just to see.

First, let’s throw some dropout into the model after the LSTM layer. Let’s not kid around: 0.5. 200 epochs this time.

train mAP lower and rising, val mAP lower and sinking

train loss dropping, val loss rising, typical overfitting pattern

It dropped the train mAP! Yea! It also dropped the val mAP! Boo!

Time for a break. It may be worth considering switching over to changing how the tokens are generated. I’m unconvinced these tokens have useful information.