Audio Tokens Part 18: The Wrap-Up

7 Oct 2024 audio-tokens

Updated 2025-04-22 with a brief intro for context.

This project explored whether short-time audio features (STFT slices) could be clustered into symbolic “tokens” and modeled using sequence architectures like BERT. It didn’t work out the way I’d hoped, but I figured out a lot about where this kind of approach breaks down, and what might be worth trying next. (Also, I spent a few days chasing phantom performance gains thanks to a classic extend() vs append() bug.)

⸻

Hoping that more data would fix things, as it often does, I gathered up a test set of 200,000 Audioset training files, ot about 10% of the final set. Many (long) tests were run over the weekend, both with the tokenized spectrograms and the raw spectrograms being fed into models as embeddings, and here is a good summary of the results:

all the peak val mAPs under 0.05 for many tests

Which says to me that it might be time to (perhaps temporarily) move on to a different project.

But let’s wrap this up with some style:

Project Wrap-up

Original Concept

The concept developed here was applying sequence model techniques to audio data. Most audio processing ML architectures use 2D slices of spectrograms or 2D CNNs. I explored whether converting spectrogram STFT slices to a limited “vocabulary” using K-means clustering would allow text-style sequence processing models to be used in audio classification. In the same way image processing techniques have been successfully applied to audio, I hoped to show that recent sequence-based NLP techniques could be applied to audio as well.

Project Environment

All code is available on Github.
Plots were created and runs were tracked with Weights & Biases.
Coffee was purchased from Happy Mug Coffee.

Architecture

Dataset

Audioset was used as the dataset, because it seemed the only one of sufficient size to train a transformer. Most other labeled audio datasets are too small to train very complex models. Since Google has never made raw Audioset data available (only preprocessed features), I originally tried to download it from individual YouTube videos, but after getting blocked by YouTube, I turned to an independently-collected Hugging Face dataset which proved invaluable.

Preprocessing

A subset of Audioset was generated from either the Audioset balanced_train set (20,000 examples) or from a combination of the balanced_train and unbalanced_train datasets (~2MM samples). That subset was split into training and validation sets.
Each audio sample was first converted to a mel-spectrogram (side note: I remain unconvinced that the Mel scale is the right approach for sounds not intended for human consumption, given its origins), converted to DB, then self-normalized. PyTorch’s MelSpectrogram and AmplitudeToDB were used.
The spectrogram was split into its individual time-slice STFT vectors, each catching the frequency strengths of the audio over a particular (very small) time period.
For some iterations, a 1D CNN was applied to each STFT vector, with {3,5} kernels with a size of {3,5}, and the results were concatentated into new time-slice vectors. This was an attempt to find useful features at the per-STFT level.
The training set STFT time-slice vectors from the training set were clustered using K-means clustering (k={50, 500, 5000}), with each cluster centroid becoming one archetypal numbered “token” for later tokenization. FAISS was used for the clustering.
The training and validation time-slice STFT vectors were then each converted into a “token” by finding the nearest token centroid by cosine distance. The tokens for each audio sample are then concatenated into a token sequence that represents the sample.

Model Architectures

Now that we have one linear token sequence for each audio file, we treat them in the same way we might handle text token sequences, and send them through sequence models:

BERT: An untrained BERT model was used. Training a BERT model was the original intent of the project, but after realizing that there was a question about whether or not the tokens carried any useful information, I largely moved to simpler models for experimenting.
LSTM: To speed up training and exploration of tokenization parameters, a simple 1-layer LSTM was used to see if the tokens contained any useful information.
Linear: A very simple embedding->average-pooling->linear layer was also used for token testing.

For baseline testing and sanity checking, the tokens weren’t used, but raw STFTs or spectrograms were fed into some other models as pre-computed embeddings:

BERT: Untrained BERT model, with raw STFT vectors dropped in at the embedding layer instead of computing embeddings from the tokens.
CNN: A 4-layer basic 2D CNN model was used on the computed spectrograms to see how a more “traditional” audio processing model would work.
Linear: The raw STFT vectors were fed into a three-layer linear model, to see if the raw STFT vectors would perform better than the tokens, but still preserve the sequential properties of the audio.

Results

Metric

Since mAP is the typical Audioset metric, peak val mAP values were used for preliminary model assessment.

Bug

A bug in the training/evaluation loop, introduced at the beginning of the project, inflated peak val mAP values to near-state-of-the-art levels for a couple of weeks. Python’s extend() was used where append() should have been. An excellent reminder to always check every tensor shape everywhere all the time early and often and forever.

Outcomes

Post-bugfix, the tokenization and raw sequence models performed slightly worse than the more traditional models, but it’s hard to make firm conclusions about their general applicability because peak mAP never went above 0.05 for any of the models. Audioset proved to be beyond these models. Overtraining was doable with most of the models, but validation results were never really adequate.

Conclusion

Audioset is quite the beast of a dataset, but results shouldn’t have been quite so bad. I’m not sure if the results were a consequence of Audioset’s general difficulty (543 classes is a lot with mAP since overall mAP is averaged across all classes) or if the models needed to be much more complex to have a chance.

I still have a (unproven) sense that using sequence-based models on audio in a truly sequence-based way (unlike AST, which takes 2D patches and converts them into pseudo-token-embeddings) is worth exploring further.

Possible future directions:

Limit Audioset classification task to a more manageable size by using a smaller number of classes.
Spend more time dealing with the large class-imbalance Audioset has, via sampling or weighting. I didn’t go down that path at all.
Try more complex models, although BERT is pretty complex. Overfitting wasn’t a problem with most of the models, but attempts at regularization just slowed down the overfitting without improving validation results.
Look more deeply at which classes the models were identifying correctly and see what properties those sounds have that might be helping the model along.
Add attention mechanisms to the LSTM models.
Investigate gradient behavior and try to fix exploding or disappearing ones.
Try implementing known successful models (like AST) within the current project as a sanity check.

Final Note

I’ve been working on nothing but this for the past few weeks, so I’ll be taking a break from this for a little while. Have no fear, I have some more interesting ideas for projects that should be showing up here soon.