Audio Tokens Part 8: Convoluted

17 Sep 2024 audio-tokens

[Update 19 Sep: Funny thing, this actually didn’t work out as planned. I forget to turn tokenization on for these examples. So the below change looks to be a random blip of some sort, not from adding convolution. Going to try to really add it later. Also, the comment below about running the same data over and over again accidentally was sort of true.]

Before switching to major, focus-shifting architecture changes here I figured I’d try something simpler first. From the last update, I’m taking the first option: run a basic 1D convolution on the STFT time-slices to try to make them a bit more frequency-invariant. Eight kernels, kernels size 3, and just concatting them together to create the new input for the K-Means clustering. The previous input, recall, was just the entire STFT time-slice vector as-is.

Let’s give it a go: [Time passes… I learn that keeping all of the convolutions in memory takes about 38GB of memory. I have 48GB. Scalability!]

It’s hard to tell, but we’ve bumped the max val mAP from 0.12 in the previous set to 0.145! Problem solved, case closed.

Well, not really. But it does show that there’s something good happening with the convolutions. Of course I ran this one with 50 tokens, which seems even smaller than it did before, although oddly up to this point it’s provided the best results. Let’s bump that to 500 tokens and see what we can see:

[…time passes while I refactor the config to put all the individual configs in one class…having num_tokens defined in one place for cluster creation and another place for model training is not ideal]

Well, it sure does memorize the training set real good. Have I noted that I’ve been running all these LSTMs with a dropout of 0.5? So this is some serious overtraining action to overcome THAT.

Why do I get the most generalizable results (barely) with 50 tokens instead of 500? I guess 50 tokens is a sort of regularization, in that it seems like too little information in general.

These LSTM runs are pretty quick. Let’s try Conv1D kernels of size 5? Everything else the same?

hrm. This is the point where I start to wonder if I messed up my data pipeline somewhere and am just processing the same data over and over again no matter what parameters I change. I mean seriously, that outlier may be in a different epoch but it looks the same.

New thought: Maybe the problem isn’t in the model. Maybe I need more training data.

I may need to crack open the AudioSet unbalanced_train set here. Only 20,000 training examples may not be enough. The unbalanced train set has around 2 MILLION examples. (The eval set in AudioSet is only 20,000 items, which seems like not nearly enough, but then again I’m no fancy big-city Google researcher.) Which means I may need to stop keeping everything in memory during preprocessing, now doesn’t it?