Audio Tokens Part 17: All Sane, So Far

3 Oct 2024 audio-tokens

Here’s my checklist from the last post:

Look at a few more generated spectrograms.
- Do they look sane? Continue.
- Do they look insane? Fix the spectrograms!
They look sane. Moving on.
Try the spectrograms with a standard vanilla CNN of the type that is known to work well on spectrograms.
- Do the results improve significantly? End this round of this project and move on–it doesn’t work as-is.
- Do the results not improve significantly? Keep going.
They are roughly the same as all the other models. Peak val mAP I can get is 0.03-ish. Continuing.
Look at the Audioset-string-label to numbered label conversion
- Good? Continue
- Wrong? Fix it!
Looks good. Continuing.
Look at the assignment of labels to ytids.
- Good? Continue
- Wrong? Fix it!
Also good.
Check the outputs of the model. See if there’s anything off-by-one in there messing with things.
- Good? Continue
- Wrong? Fix it!
Model seems to work OK on frequent classes, poorly on infrequent ones.
If everything above checks out, maybe I’m just not using enough data. Maybe the balanced training set just isn’t enough with all these categories. Shovel 25-50% of Audioset training data into the pipeline.
- Does it work better? Maybe that’s the issue?
- Does it have the same bad results? Throw MacBook out of window, wrap this up, and try a different project.

We’ve made it to the final step. Looking at the mAP for all of the classes working with the small balanced_train dataset, it’s not so bad with a simple model on frequent classes (“Speech,” etc.), where it gets 0.26 or so, but most of the classes seem to be a 0.0something. Average out 500 0.somethings and a few 0.2s and you get 0.something.

Given that all of the models are doing poorly on low-frequency classes, which might make sense given 500+ classes in 20,000 files, one approach is to shovel a whole bunch more training data into the machine. Luckily there are about two million more samples to choose from.

Unfortunately this brings up a bottleneck–my spectrogram generation code is serial, and will take a long time to process a million files. So I need to parallelize the spectrogram generation.

I got it partly done, but got bottlenecked on writing each spectrogram to a individual file, which is how the rest of the code expects them. I may need to change the storage and retrieval of spectrograms. I’m thinking either stuffing them into SQLite or using HDF5, which I haven’t had the chance to use yet. Either way, I’m going to throw some more data at this and if things still don’t improve, I’ll move on, perhaps temporarily, to another project.