Audio Tokens Part 6: Slightly Less Basic

12 Sep 2024 audio-tokens

In our last episode, I had managed to get a dead-simple model to overfit on the sequences when cranking up the number of tokens in the vocabulary. This probably means one of two things (or probably something in between):

The audio tokens have some useful information in them that can be generalized.
The audio tokens have no useful information in them, and the overfitting is just because the model is able to memorize the embedding averages when there are many more embeddings involved.

Let’s bet on the first one for now. Keep the spectrogram and token generation as-is, and try a slightly more complex model. Say hello to SimpleLSTMClassifier.

Audio Tokens Part 5: Back to Basics

11 Sep 2024 audio-tokens

OK, so things aren’t working as well as I’d hoped out of the box. Time to try a new box. Or a new metaphor.

I’m going to try a super-simple network just to see if I can get something to fit.

Enter SimpleTokenClassifier. Take the tokens, create embeddings, average pool them, shove them through a linear layer and see if anything useful comes out. With 50 tokens:

Audio Tokens Part 4: More Tokens!

5 Sep 2024 audio-tokens

Well, more types of tokens, anyway. Instead of a limited vocabulary of 50 tokens, let’s try something a little more interesting. Like 1000. [time passes] Ran it, here’s an excerpt:

2024-09-05 18:16:23,258 - INFO - Epoch 9
2024-09-05 18:16:23,258 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0824
2024-09-05 18:16:23,258 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0826
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:40<00:00,  2.22it/s, loss=0.0169]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00,  7.22it/s]
2024-09-05 18:24:33,739 - INFO - Epoch 10
2024-09-05 18:24:33,739 - INFO - Train Loss: 0.0210, Train F1 (macro): 0.4990, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.0810
2024-09-05 18:24:33,739 - INFO - Val Loss: 0.0209, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.0893

That’s…terrible. Let’s try bumping the learning rate from 5e-5 to 1e-3 just to show that we’re serious.

Audio Tokens Part 3: The First Run

5 Sep 2024 audio-tokens

Time to try the first run at training the model. Let’s see what happens!

2024-09-05 13:59:48,289 - INFO - Epoch 1
2024-09-05 13:59:48,289 - INFO - Train Loss: 0.0429, Train F1 (macro): 0.4998, Train F1 (micro): 0.9946, Train Hamming Loss: 0.0054, Train mAP: 0.0106
2024-09-05 13:59:48,289 - INFO - Val Loss: 0.0210, Val F1 (macro): 0.4991, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1074
Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00,  2.22it/s, loss=0.0196]
Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00,  7.16it/s]
2024-09-05 14:07:58,594 - INFO - Epoch 2
2024-09-05 14:07:58,594 - INFO - Train Loss: 0.0208, Train F1 (macro): 0.5210, Train F1 (micro): 0.9962, Train Hamming Loss: 0.0038, Train mAP: 0.1135
2024-09-05 14:07:58,594 - INFO - Val Loss: 0.0205, Val F1 (macro): 0.5629, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1144
Training: 100%|████████████████████████████████████| 1022/1022 [07:40<00:00,  2.22it/s, loss=0.0153]
Validating: 100%|█████████████████████████████████████████████████| 146/146 [00:20<00:00,  7.17it/s]
[...]
2024-09-05 15:21:42,568 - INFO - Epoch 11
2024-09-05 15:21:42,568 - INFO - Train Loss: 0.0177, Train F1 (macro): 0.5805, Train F1 (micro): 0.9963, Train Hamming Loss: 0.0037, Train mAP: 0.1851
2024-09-05 15:21:42,568 - INFO - Val Loss: 0.0188, Val F1 (macro): 0.5663, Val F1 (micro): 0.9963, Val Hamming Loss: 0.0037, Val mAP: 0.1548
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00,  2.21it/s, loss=0.0215]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00,  7.15it/s]
[...]
2024-09-05 16:27:17,502 - INFO - Epoch 19
2024-09-05 16:27:17,502 - INFO - Train Loss: 0.0116, Train F1 (macro): 0.6765, Train F1 (micro): 0.9968, Train Hamming Loss: 0.0032, Train mAP: 0.4616
2024-09-05 16:27:17,502 - INFO - Val Loss: 0.0208, Val F1 (macro): 0.5698, Val F1 (micro): 0.9962, Val Hamming Loss: 0.0038, Val mAP: 0.1095
Training: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1022/1022 [07:41<00:00,  2.21it/s, loss=0.0116]
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:20<00:00,  7.13it/s]
2024-09-05 16:35:29,332 - INFO - Epoch 20
2024-09-05 16:35:29,332 - INFO - Train Loss: 0.0103, Train F1 (macro): 0.7092, Train F1 (micro): 0.9970, Train Hamming Loss: 0.0030, Train mAP: 0.5503
2024-09-05 16:35:29,332 - INFO - Val Loss: 0.0212, Val F1 (macro): 0.5895, Val F1 (micro): 0.9960, Val Hamming Loss: 0.0040, Val mAP: 0.1201

So, um. Reading the mAP values, it starts at bad, increases to slightly less bad, then the overfitting kicks in and training mAP gets excellent and validation mAP drops to bad again. Not great.

Audio Tokens Part 2: The Architecture

4 Sep 2024 audio-tokens

Here’s how things are set up now, in preparation for the first real training test:

(I’ve skipped over the “is this thing basically working” phase here, since it wasn’t that exciting. Take it as given that each component seems to be working as expected.)

Training Set

I’m using the AudioSet “bal_train” set, since it’s only around 20,000 files. AudioSet files are 10-second clips from YouTube videos.

Validation Set

I’m using a very sophisticated technique in data_splitter.py to pull a validation set out of the bal_train set. If the associated YouTube ID for an audio clip starts with one of the characters [ABCDEF], it’s a validation example. Assuming YouTube IDs are random-ish, this isn’t totally terrible.

Audio Tokens Part 1: The Task

3 Sep 2024 audio-tokens

[7 Oct 2024 update: There’s a lot here in this worklog, dead ends and all. If you’re looking for the tl;dr/wrap-up, may I suggest the final (for now) post?]

[23 Sep 2024 update: This is a work log largely for my personal use: “Remember kids, the only difference between screwing around and science is writing it down”. There are mistakes and dead-ends and obvious oversights and serious facepalm moments. My plan is to leave in all the screwups, not try to retcon things later into a “I knew this from the beginning” sort of thing. As of today, things are still in progress, and there are ten of these entries. I expect there will be more.]

Previous Page 3 of 3