Kernel Shape in a CNN Audio Model

15 May 2025 audio

(Code on GitHub.)

Audio has a strong temporal component. Unlike an image, audio is a thing that happens in time, not an arrangement of items in a space. And yet many audio classification models treat spectrograms as if they were still images and not events, an artifact of early successes applying visual models to audio datasets.

I took the ESC-50 dataset, created a simple five-layer CNN model, and trained it with various kernel shapes and sizes. My hypothesis: kernels that extend more in the temporal dimension will have better performance.

One Double RoBERTa, with a Side of Strange

6 May 2025 attention-encodings

[The full demo tool, with the new model selection dropdown, is available for playing with here.]

[Previous posts here and here]

I added a model switcher to the attention-encoding demo: now you can toggle between roberta-base and roberta-large to compare how token representations evolve layer by layer.

The differences: roberta-base: 12 layers, hidden size 768, 12 attention heads roberta-large: 24 layers, hidden size 1024, 16 attention heads

So let’s try some of the previous roberta-base plots in roberta-large and see if large follows the same patterns.

What's Going On at Layer 5?

29 Apr 2025 attention-encodings

Project demo page

Continuing on from the last post, you might have had the same question I did:

What the heck, layer 5?

Let’s go back to the relative ranking plot. This chart shows how many original vocabulary token embeddings are closer (by cosine distance) to a token’s current encoding at each layer than the token’s own original embedding.

So if the current token encoding at position 1 is closest to its own original embedding, it has a rank of 1. If 100 other vocab embeddings are closer, the rank is 100.

More Attention to Attention

22 Apr 2025 attention-encodings

[Note: the full demo tool is available for fiddling with here.]

A few months ago, I built a demo to visualize how token representations evolve inside the RoBERTa transformer model. Like a lot of people, I had the vague impression that transformers gradually contextualize each token layer by layer, refining its meaning as it flows through the network. But the more I looked, the less that held up. Depending on how I measured representational change (cosine similarity, vocabulary ranking, nearest neighbor overlap), I got completely different stories. This post is a short walkthrough of what I found, and how it challenged my assumptions about how and when transformers “make meaning.”

Paying Attention Part 2

8 Nov 2024 attention-encodings

A greatly-revised attention-encodings project is out and available here.

I realized in the previous incarnation of this project I was comparing per-head attention outputs to value-encodings over the entire model vocabulary, which was a clever idea. The problem is that I forgot about the residual connection around the attention mechanism. So the project was essentially comparing the residual “deltas” against the original encodings, which might make some sense if you were just wondering how “far” an encoding moves at each layer.

Paying Attention to Attention

31 Oct 2024 attention-encodings

Fresh off the last experiment, I’ve decided to go back to an old experiment and dig a little deeper.

A while ago I was working on an experiment trying to figure out what attention layers are actually doing in transformers.

The traditional introduction to transformers goes something like this:

You turn your text into tokens
You create embeddings for each of those tokens
Some additions and projections later, that token encoding goes into an attention layer, at a particular spot in the sequence based on the order of the text. Let’s assume our token is at position 3.
The attention layer compares that token encoding to all the others in the sequence (query and keys, yes yes), and uses the relative weights of their similarities as weights to apply to the original embeddings in the sequence (projected by a value matrix.) (This won’t make sense if you don’t already know how transformers work, for which I apologize.)
The output of that attention layer at position 3 is the original token, but with information from closely-related tokens mashed into it. So it’s a wider, more conceptually complex representation of the original token at position 3.
Do this 12 times with some standard fully-connected neural networks in-between, and you end up with a new sequence where each encoding is a representation of that original token, but with “meaning” infused based on the overall context of every other token in the sequence.
Use those fancy high-class output encodings to predict the next token, or classify your sequence, or whatever.

But that “output of the attention layer is another form of the original token” has always been interesting to me. What does an encoding like that look like? Can you do something else with it? Does it relate back to the original token in an interesting way?

Audio Tokens Part 18: The Wrap-Up

7 Oct 2024 audio-tokens

Updated 2025-04-22 with a brief intro for context.

This project explored whether short-time audio features (STFT slices) could be clustered into symbolic “tokens” and modeled using sequence architectures like BERT. It didn’t work out the way I’d hoped, but I figured out a lot about where this kind of approach breaks down, and what might be worth trying next. (Also, I spent a few days chasing phantom performance gains thanks to a classic extend() vs append() bug.)

⸻

Audio Tokens Part 17: All Sane, So Far

3 Oct 2024 audio-tokens

Here’s my checklist from the last post:

Look at a few more generated spectrograms.
- Do they look sane? Continue.
- Do they look insane? Fix the spectrograms!
They look sane. Moving on.
Try the spectrograms with a standard vanilla CNN of the type that is known to work well on spectrograms.
- Do the results improve significantly? End this round of this project and move on–it doesn’t work as-is.
- Do the results not improve significantly? Keep going.
They are roughly the same as all the other models. Peak val mAP I can get is 0.03-ish. Continuing.
Read more...

Audio Tokens Part 16: Sanity Checks for Everyone!

3 Oct 2024 audio-tokens

I’ve been on a bit of a tear the past few days (right now at commit 3550615). I separated out the Dataset/DataLoader processing into its own classes,moved metrics calculation into its own class, did a bit of cleanup refactoring, and all of that so I could start just sending the raw STFT vectors into models as embeddings, and also add a new dirt-simple baseline model.

All of this to try to figure out if the consistently terrible val mAP results that seem to happen on every variation of model and hyperparameters are just because this idea doesn’t work as-is, or if there might be another bug in the preprocessing pipeline mucking things up.

Trees and Language

30 Sep 2024 animal-communication

From 2017: A biologist believes that trees speak a language we can learn

I haven’t followed up yet on the book being written about here, but I do have the general sense that we’re going to be surprised how many living things this will end up being true of.

This is a large part of why machine learning is interesting to me. If it’s about nothing else, machine learning is all about pattern recognition–finding patterns that are too complex for us to find on our own, or in a latent space that we aren’t really able to comprehend. Like non-human communication.

1 of 3 Next Page