Audio Tokens Part 12: A Minor Bug Squash

24 Sep 2024 audio-tokens

I decided that rather than find out why things were so much worse before that one commit, I’d look for data leakage in the newer, better-performing code.

And I haven’t found any. The train/dev split is straighforward, using python indexing of a shuffled list of youtube ids. The centroids are computed only from training data. And the model training keeps the train and val data separate in two different Dataset classes.

But I did find a fun bug just after that commit which dropped the val mAP from 0.43 to 0.30. The tokenization batching was writing its output in the middle of a loop, so last batch won. Which meant that training data was about 20% of what it should have been.

Having noted that, I may just have to assume that this new mAP is…valid? It still seems too high. It would mean that the clustering/audio-vocabulary creation is doing a heckuva lot of work here, since I’m just feeding its sequences into a one-layer, completely untuned LSTM.

I may come back to this at some point, since I’m not sure what bug got squashed here to change the results so much, and it makes me suspicious.

In the meantime I’ve realized that my batching code for ClusterCreator and SpecTokenizer is inside out–it’s batching slices within a file instead of batching the files, so it doesn’t help much. Need to fix that.