Audio Tokens Part 9: What the bug?

19 Sep 2024 audio-tokens

I going to crack open the AudioSet unbalanced_train set here. Only 20,000 training examples may not be enough. The unbalanced train set has around 2 MILLION examples. (The final eval set in AudioSet is only 20,000 items, which seems like not nearly enough, but then again I’m no fancy big-city Google researcher.) Which means I may need to stop keeping everything in memory during preprocessing, now doesn’t it?

So I rewrote a bunch of stuff yesterday to allow for a larger training set, consolidate up the train/dev split code, and clean up the TokenizedSpecDataset class.

I located two interesting and related bugs:

The num_classes parameter for the BERT model, instead of 631 (the number of AudioSet classes), was hard-coded to, um….10. That explains why THAT model wasn’t doing anything useful. (Why 10, you might ask? Because I had tried out this model with the urbansound8K dataset, which has 10 classes. facepalm)
The TokenizedSpecDataset has num_classes set to 527, so it was only sending 527-wide label tensors to the model instead of 631. Huh. Also problematic.

I moved all num_classes into the config, so that should fix THAT.

So with those fixes and the new pipeline code, I run the LSTM model again and see if there are any changes:

typical loss overfitting, but lower than before

what.

Let’s compare that to the last run, pre-refactoring:

new mAP ridiculously higher than the previous

what.

A 0.41 peak val mAP. This is a one-layer LSTM with some convolutional pre-processing and no serious tuning.

If this were real, it’s at least in the ballpark of 0.45, which would be SOTA from a few years ago, and perhaps might get there, he said optimistically.

But you’re having the same thought I am: this is a bug. A big one. Time to scour the pipeline for data leakage.

[…time passes]

I can’t find anything. The new split code is pretty simple. There’s one thing that’s odd, which is that all instances of TokenizedSpecDataset receive a dict that contains all of the train and dev assigned labels. So the train dataset instance includes all of the train and val labels. But it never actually uses the val labels for anything. I should probably fix that, but I don’t think that’s the issue.