Audio Tokens Part 13: The Big Bug
audio-tokensBatch clustering loops fixed, so large training sets are doable now.
Let’s run a ~80,000 sample from the training sets, with ~8000 as validation:
We know these numbers aren’t real, right? .484 on AudioSet is SOTA as of 2021. And this model is WAY too simple.
The question is, how do I prove it and find the real numbers?
Time to verify some of these results manually, and run some examples through a trained model that aren’t in the training or validation sets.
Introducing ManualTester, a Frankensteinian mashup of copies and pastes from all the other files! (note to self: refactor everything to make this sort of thing easier)
ManualTest for ytid “i50igBzmCdc”. Truth labels first, predictions afterward:
2024-09-25 13:44:00,739 - INFO - Labels: ['Guitar', 'Music', 'Plucked string instrument', 'Soundtrack music']
torch.Size([1, 1723])
2024-09-25 13:44:00,864 - INFO - Rank Category Value
2024-09-25 13:44:00,864 - INFO - 1 0.818 Music *
2024-09-25 13:44:00,864 - INFO - 2 0.355 Guitar *
2024-09-25 13:44:00,865 - INFO - 3 0.329 Plucked string instrument *
2024-09-25 13:44:00,865 - INFO - 4 0.312 Musical instrument
2024-09-25 13:44:00,865 - INFO - 5 0.306 Electric guitar
2024-09-25 13:44:00,865 - INFO - 6 0.264 Heavy metal
2024-09-25 13:44:00,865 - INFO - 7 0.187 Rock music
2024-09-25 13:44:00,865 - INFO - 8 0.181 Speech
2024-09-25 13:44:00,865 - INFO - 9 0.130 Strum
2024-09-25 13:44:00,865 - INFO - 10 0.116 Progressive rock
Not too shabby, at least for this one. A few other tests show reasonable-if-not-quite-this-good results.
I also realized one more thing: I’ve been truncating all input sequences to 512 tokens in the datasets, which seemed like enough once, but with a smaller n_fft now, the sequences are 1723 tokens long. So it’s only processing the first three seconds or so of the audio clip right now.
Let’s re-run the model with the limits off (commit 6f27c97):
[…and CUDA runs out of memory. Silly 8GB VRAM card. Drop the batch size from 128 to 64.]
The new peak mAP is 0.517. Which has got to be even more wrong than before.
The one-layer LSTM model was just meant to test that the tokens contained useful information. Just to make sure there’s nothing really wonky with that model, let’s go back to the dirt-simple linear model. It should do something interesting but if it works just as well as the LSTM I will be very suspicious of the whole enterprise. Keeping everything else the same…(commit 27f8845):
Previous LSTM run plotted as well. The linear classifier is worse than the LSTM, but still way too high. Either there’s still some bug I’m not seeing, or these tokens are magic.
I need to break this data pipeline down piece by piece and find out what’s going on.
[…time passes]
And I found the issue. The labels and prediction tensors were being flattened in ModelTrainer, so instead of an (a,b) tensor for each one, it was one (a*b,) tensor. None of the metrics complained, they just assumed it was a binary classification test and reported weird numbers.
extend() vs append(), kids. Don’t mix ’em up. Fix in commit 5e4aa2f.
A bummer, but an unsurpising one. The tokens aren’t magic after all.
A quick test run with the fix. The new peak val mAP? 0.018. That’s certainly an adjustment. The tokens may in fact be cursed.