Audio Tokens Part 1: The Task

3 Sep 2024 audio-tokens

[7 Oct 2024 update: There’s a lot here in this worklog, dead ends and all. If you’re looking for the tl;dr/wrap-up, may I suggest the final (for now) post?]

[23 Sep 2024 update: This is a work log largely for my personal use: “Remember kids, the only difference between screwing around and science is writing it down”. There are mistakes and dead-ends and obvious oversights and serious facepalm moments. My plan is to leave in all the screwups, not try to retcon things later into a “I knew this from the beginning” sort of thing. As of today, things are still in progress, and there are ten of these entries. I expect there will be more.]

[27 Sep 2024 update: Found a major bug that makes pretty much all of the evaluation metrics up to today incorrect. So…ignore those. Part 14 picks up the story from the bug fix on.]

My current project: audio tokens!

The idea: Audio classification has typically been done by producing spectrograms (usually mel-scale amplitude plots) and then using visual ML techniques on the images generated to identify relevant audio features with a CNN. Given the success of transformers in classifying and generating language sequences, more recently there have been attempts to use transformers for image and audio processing. In the audio world, the Audio Spectrogram Transformer (AST) is the most well-known attempt. It is based on the Vision Transformer (ViT), which chops an image into an overlapping grid, then uses the flattened grid elements as embeddings to train a transformer, skipping the tokenization process entirely. The AST does the same, but with audio spectrograms.

But I was thinking…unlike an image, an audio clip is really a sequence in the same way that a sentence is in a text document, and by slicing it up into 2D squares, it seems like some valuable time information might be lost, even with positional encodings added on–they’re still 2D positions. So why not try to treat an audio clip as a sequence instead of an image? Then maybe you could train a transformer to classify audio without losing the essential time-based nature of sound?

This brings up a couple possible options.

Skip tokenization. Like AST does with its 2D patches, treat each individual STFT time-slice of an audio file as a pseudo-embedding and feed it into a transformer. That way you don’t have the slightly awkward conversion of a 2D patch into a 1D embedding of the AST. Treat the mel-spectrum vectors as pre-generated token embeddings.
Tokenize the audio. Take all of the STFT time-slice vectors in all of your training data and feed them into a clustering algorithm to create a set of clusters of STFT vectors. Now you have a “vocabulary” of STFT clusters, and with a vocabulary, you can tokenize your audio. Number all of the cluster centroids, and for every STFT slice in a particular audio example in your training data, the nearest cluster centroid is the next “token” in the audio sequence. Covert every audio file into a linear sequence of centroid IDs, and use those sequences to train a transformer. That way your model can learn embeddings for each token instead of processing pseudo-embeddings which are all going to be unique.

To be honest, the second one seemed more interesting. It would be an attempt to directly adapt the power of text transformers to the audio space.

So I’ve started an experiment using AudioSet to test it. Tokenize the audio and train a transformer with those sequences of tokens and see if it can tell the difference between a dog barking and a door slamming.

Details and a Github link coming.