CreatorsOk
3blue1brown
3blue1brown

patreon


Next transformer chapter (final version)

Here's the final version of Chapter 7 of the neural net series, about the multilayer perception blocks in transformers, as motivated by the question of how LLMs may store facts. The plan is to make it public tomorrow, let me know if you catch any little errors in the meantime.

Next transformer chapter (final version)

Comments

Ahhh, so satisfying! Thank you Grant! There is one thing that I find confusing: why would the interpretation of the embedding space (i.e. the semantic interpretation of its basis of dimension vectors) remain the same throughout? I get that the first and last layers need to relate vectors to concepts identically, as we use the same embedding coder and decoder in both, but couldn't it be possible that in all the 94 layers in between the embedding space takes on totally different meanings? I wonder if there's a simple answer that I'm missing or if it's more of a complex finding from other research. Not feedback for this chapter (it's perfect as it is), but if I'm not the only one wondering this it maybe something to mention in future ones.

Steven Siddals

Just wow!

Holger Flier

Hi Grant, are you going to cover the process by which the list of predictions of the next word and their probabilities are extracted as a function of the final token?

Meghan

The Johnson–Lindenstrauss lemma seems to be an extraordinary result - can you please provide some intuition as to why it’s true, especially because one can’t really “see” this in 2 or 3 dimensions?

Meghan

It's interesting that 12288 is 12 x 1024 which probably means the calculations can be more easily broken down into 1k blocks for running on several GPUs/GPU cores. It also suggests why they picked 4x of that to determine the size of the inner matrix. I wonder whether they did any experiments to determine whether 13 x or 11 x would have been very different in resulting power compared to their chosen value. (Also, it might be that they factored it 3 x 4096 or 6 x 2048 instead of 12.) Did they publish any papers giving a hint?

William Smith

The plot is showing all angles between *all* pairs of vectors. So in this example, every possible pair is between 89 and 91 degrees apart.

3blue1brown

At 17:13, the speech bubbles overlap slightly which makes it hard to read the text. Not sure if that’s intentional or a mistake

Kyra

You list Batch Normalization at 21:45, but that's not used in modern transformers, might be worth cutting

Neel Nanda

Another highly subtle mathematical insight: around 17:09, it should probably say "stores" instead of "store". (:

wye

There's a typo in the word "Michael" around 5:18. (truly I caught a very fundamental mathematical error there)

wye

So, in two days, I will start a PhD in david bau's lab: https://baulab.info/ (well-known for the ROME paper: https://rome.baulab.info/) I don't know what the joint probability is that you posted a video on exactly the topic I'm about to start a PhD in less than a week before I start, when you were the one that sparked my initial interest many years ago in math, but I'm guessing it is vanishingly small - feels very serendipitous. Thanks for the excellent series, as always! EDIT: thoughts - maybe it would be good to mention that this is all the transpose of how it's actually implemented in practice. So in actual code, tokens are encoded as row vectors, and an operation on a row vector would look like x^T W, where W is the weight matrix doing the linear transformation. Also, at 21:31, you may consider changing "Sparse Autoencoder" in the speech bubble to "Sparse Autoencoders"

Alex Loftus

Yep, that *is* a surprising result about higher dimensional spaces on a par with how spiny high dimensional hypercubes get. I'm curious if your code is measuring simply "angle between individual pairs of vectors" or if its getting "*minimum* angle between a single vector and *every* other vector"? Because one gets the impression that for a feature to be distinct, saying that it's nearly perpendictular to "lots" of other features the training might encode might not be as helpful as saying that it's nearly perpendicular to *every* other feature encoded.

Jesse Thompson

Great exposition, as always

Daniel Armesto


More Models and Creators