Audio Tokens Part 2: The Architecture

4 Sep 2024 audio-tokens

Here’s how things are set up now, in preparation for the first real training test:

(I’ve skipped over the “is this thing basically working” phase here, since it wasn’t that exciting. Take it as given that each component seems to be working as expected.)

Training Set

I’m using the AudioSet “bal_train” set, since it’s only around 20,000 files. AudioSet files are 10-second clips from YouTube videos.

Validation Set

I’m using a very sophisticated technique in data_splitter.py to pull a validation set out of the bal_train set. If the associated YouTube ID for an audio clip starts with one of the characters [ABCDEF], it’s a validation example. Assuming YouTube IDs are random-ish, this isn’t totally terrible.

Spectrogram Generation (SpectrogramProcessor)

This is a fairly straightforward mel-spectrogram generator. Take an audio file, make a picture of its amplitude per frequency bin per time period.

Input:

A list of YouTube IDs of audio files to create spectrograms from. Generated by AudiosetMetadataProcessor.
The location of a top-level directory of AudioSet audio files. Default is “/media/davery/audioset” because that’s where my AudioSet HDD is mounted.
The “source set” of the dataset to use. For AudioSet the values are “bal_train”, “unbal_train”, and “eval”. (Currently only “bal_train” works at the moment because it’s hardcoded into AudiosetMetadataProcessor.)

Output:

A pickle file containing a list of {filename, spectrogram} items. Default location is processed/specs.pkl.

Hyperparameters:

sample_rate: 22050. Files that aren’t at this rate get converted to it. Why 22050? It’s common, and why not?
normalization: True. All of the spectrogram values are individually linearly normalized to be between 0 and 1.
n_fft: 1024. A typical starting point for STFT window size
hop_length: 512. Overlap the STFT windows by half their size
n_mels: 64. Now this gets more interesting. 64 is a common starting point, but I can see this one needing to be adjusted. (I have conceptual issues with using the mel scale for sounds not generated by or for humans, given where it comes from, but we’ll stick with it for now.)

Generating the clusters/vocabulary (ClusterCreator)

This is where the audio token vocabulary is generated. Take each time-slice vector of every spectrogram in the training set, throw them all into a K-Means clustering algorithm (currently using FAISS for this), and come up with a list of centroids. Each of your centroids/clusters is now a token in the audio vocabulary of the training corpus.

Input:

Spectrograms of the training set. By default they’re assumed to be in a pickle file in “processed/”.

Output:

A numpy array of n_clusters cluster centroid locations. By default saved to “output/centroids.npy”

Hyperparameters:

n_clusters: 50. This is a big one. This is the number of clusters to generate, or how large we want our corpus vocabulary to be. Too many or too few and the clusters won’t really be dense enough or distinct enough to convey any useful information. 50 is almost certainly too few, but we’ll see.
niter: 20. How many iterations for FAISS to perform in the clustering operation.

Tokenizing the training set (SpecTokenizer)

Now for every file in the training and validation sets, we classify each STFT time-slice vector into one of n_cluster clusters based on distance from the centroids. We convert our each spectrograms into a sequence of audio tokens.

Input:

Spectrograms of the training and validation sets
the cluster centroid locations just created

Output:

Tokenized versions of every example audio file. Default location is “tokenized/[train|validation]”

Hyperparameters:

None.

Training a transformer (ModelBuilder)

We take a standard uninitialized BERT model and add layers to send the [CLS] token vector into a single linear layer to project it into 631 possible outputs, one for each AudioSet label

Input:

Tokenized audio files

Output:

Magic. Or maybe just a transformer that can classify AudioSet files.
Train and validation mean average precision scores (mAP), since that’s what AudioSet models are typically evaluated with. Also, F1 scores and Hamming loss for fun.

Hyperparameters:

vocab_size: This better be the same as n_clusters in ClusterCreator, or things will be weird.
num_layers: 12. The depth of a standard BERT model.
epochs: 20
batch_size: 16. Because that’s what my RTX 3070 can handle in 8GB of VRAM with a standard 12-layer BERT model.
learning_rate: 5e-5. This should probably not be a constant, but we have to start somewhere.