Audio Tokens Part 15: Onward or Not?

30 Sep 2024 audio-tokens

Carrying on from the last post, I tried several different combinations of STFT, LSTM, and clustering parameters. Without going into detail about all the things I tried, we can just skip to the results:

Which means either there’s a serious bug in the training, or this particular tokenization approach simply doesn’t work.

I could keep going with some other fundamental changes to the preprocessing and the model. I really do think it’s still worth looking into treating audio as a sequence instead of as an image.

Maybe try just using raw STFT data and dropping it into a model as “embeddings” might work better. Option 1 from the initial writeup.
Maybe initializing the burrent embeddings to the token cluster centroids might get it into a more interesting part of the loss curve (or at least it might speed up the bad results).
Maybe I should do more investigation into what’s happening to the gradients here, although with a one-layer LSTM I wouldn’t think they’re out of control.

Or maybe I need to take a break and try a different project for a little while.