Audio Tokens Part 7: Reconsidering

16 Sep 2024 audio-tokens

Given the current (lack of) performance of this model, I’m spending some time rethinking some of the basic ideas. For those of you following along at home, you may have been shouting at the screen since Part 1 trying to get me to see one large flaw in my tokenizing setup.

Every STFT time slice is a set of particular frequencies at a moment in time. Which would probably work fine if the labels were about, for example, recognizing individual dogs. But our labels here are things like “dog barking” vs. “pneumatic drill” not “cute little Fluffy barking” vs “Butch barking right before he eats someone”. Some dogs have high frequency barks and some have low frequency barks, and STFT vectors from those two dog barks would be nowhere near each other in the vector space.

And in the realm of multiclass classification, “dog barking at a running pneumatic drill” slices aren’t going to be super close to “dog barking” or “pneumatic drill” slices.

It looks like I need to make some changes. I could:

Run 1D convolution on the slices to make them more frequency-invariant before clustering.
Create an additional sequence of tokens based on the sound amplitude over mel bands instead of over time slices (maybe also with a 1D convolution). That would provide some time-based information, which could be a separate sequence of tokens to concat to the frequency token sequence to send into the model. That way frequency and time are accounted for, although the time and frequency information would be separated in the sequence. I’ll admit, this one is a little odd.
Create 2D time-frequency patches of the spectrogram and send those into the model as embeddings. I can call it AST. (It’s almost like they were on to something!)
Create 2D time-frequency patches of the spectrogram and tokenize/cluster those before sending them into the spectrogram. I’m not sure that would do anything better than AST except information loss. But it might be worth trying at some point.
Stop doing quite so much old-school feature engineering in the cluster creation, and start learning tokens/clusters in the model. It would be a major downgrade in potential interpretability, but might work better.