3blue1brown

3blue1brown

Next transformer chapter (final version)

Added 2024-08-30 17:35:22 +0000 UTC

Here's the final version of Chapter 7 of the neural net series, about the multilayer perception blocks in transformers, as motivated by the question of how LLMs may store facts. The plan is to make it public tomorrow, let me know if you catch any little errors in the meantime.

Next transformer chapter (final version)

Comments

Ahhh, so satisfying! Thank you Grant! There is one thing that I find confusing: why would the interpretation of the embedding space (i.e. the semantic interpretation of its basis of dimension vectors) remain the same throughout? I get that the first and last layers need to relate vectors to concepts identically, as we use the same embedding coder and decoder in both, but couldn't it be possible that in all the 94 layers in between the embedding space takes on totally different meanings? I wonder if there's a simple answer that I'm missing or if it's more of a complex finding from other research. Not feedback for this chapter (it's perfect as it is), but if I'm not the only one wondering this it maybe something to mention in future ones.

Steven Siddals

2024-09-01 16:27:53 +0000 UTC

Just wow!

Holger Flier

2024-09-01 14:12:04 +0000 UTC

Hi Grant, are you going to cover the process by which the list of predictions of the next word and their probabilities are extracted as a function of the final token?

Meghan

2024-09-01 10:59:55 +0000 UTC

The Johnson–Lindenstrauss lemma seems to be an extraordinary result - can you please provide some intuition as to why it’s true, especially because one can’t really “see” this in 2 or 3 dimensions?

Meghan

2024-08-31 19:53:58 +0000 UTC

It's interesting that 12288 is 12 x 1024 which probably means the calculations can be more easily broken down into 1k blocks for running on several GPUs/GPU cores. It also suggests why they picked 4x of that to determine the size of the inner matrix. I wonder whether they did any experiments to determine whether 13 x or 11 x would have been very different in resulting power compared to their chosen value. (Also, it might be that they factored it 3 x 4096 or 6 x 2048 instead of 12.) Did they publish any papers giving a hint?

William Smith

2024-08-31 16:45:59 +0000 UTC

The plot is showing all angles between all pairs of vectors. So in this example, every possible pair is between 89 and 91 degrees apart.

3blue1brown

2024-08-31 04:18:03 +0000 UTC

At 17:13, the speech bubbles overlap slightly which makes it hard to read the text. Not sure if that’s intentional or a mistake

Kyra

2024-08-31 03:15:37 +0000 UTC

You list Batch Normalization at 21:45, but that's not used in modern transformers, might be worth cutting

Neel Nanda

2024-08-31 01:29:38 +0000 UTC

Another highly subtle mathematical insight: around 17:09, it should probably say "stores" instead of "store". (:

wye

2024-08-30 19:51:50 +0000 UTC

There's a typo in the word "Michael" around 5:18. (truly I caught a very fundamental mathematical error there)

wye

2024-08-30 19:38:07 +0000 UTC

So, in two days, I will start a PhD in david bau's lab: https://baulab.info/ (well-known for the ROME paper: https://rome.baulab.info/) I don't know what the joint probability is that you posted a video on exactly the topic I'm about to start a PhD in less than a week before I start, when you were the one that sparked my initial interest many years ago in math, but I'm guessing it is vanishingly small - feels very serendipitous. Thanks for the excellent series, as always! EDIT: thoughts - maybe it would be good to mention that this is all the transpose of how it's actually implemented in practice. So in actual code, tokens are encoded as row vectors, and an operation on a row vector would look like x^T W, where W is the weight matrix doing the linear transformation. Also, at 21:31, you may consider changing "Sparse Autoencoder" in the speech bubble to "Sparse Autoencoders"

Alex Loftus

2024-08-30 19:08:03 +0000 UTC

Yep, that is a surprising result about higher dimensional spaces on a par with how spiny high dimensional hypercubes get. I'm curious if your code is measuring simply "angle between individual pairs of vectors" or if its getting "minimum angle between a single vector and every other vector"? Because one gets the impression that for a feature to be distinct, saying that it's nearly perpendictular to "lots" of other features the training might encode might not be as helpful as saying that it's nearly perpendicular to every other feature encoded.

Jesse Thompson

2024-08-30 18:35:22 +0000 UTC

Great exposition, as always

Daniel Armesto

2024-08-30 18:27:16 +0000 UTC

More Models and Creators

Sarah Schwarz

Sarah Schwarz

patreon

wanksy

wanksy

patreon

GentlemanPaux

GentlemanPaux

fanbox

へい

fanbox

Shabazik

Shabazik

patreon

namazudayo

namazudayo

fanbox

Organizing Wonderland

Organizing Wonderland

gumroad

miing

miing

patreon

AliSex

AliSex

patreon

ricotte

ricotte

fanbox

cdsoeun78

cdsoeun78

patreon

木葉あかり

木葉あかり

fanbox

Silky Whisper

Silky Whisper

patreon

Pink Fur Comics

Pink Fur Comics

patreon

zoku

zoku

fanbox

PLAboy

PLAboy

fanbox

Tachibana

Tachibana

gumroad

TimidHakana

TimidHakana

patreon

PekoLi

PekoLi

fanbox

JorgenVRC

JorgenVRC

gumroad

種田

fanbox

PapaOso

PapaOso

patreon

まんじやまだ

まんじやまだ

fanbox

JustShep

JustShep

patreon

MegaArt

MegaArt

patreon

darkfiredesigns

darkfiredesigns

patreon

oniiyanna

oniiyanna

patreon

Sweepkii

Sweepkii

patreon

nori1213

nori1213

fanbox

JSZA18

JSZA18

patreon

micknasty

micknasty

patreon

Loar

Loar

patreon

八郎

fanbox

rebouks

rebouks

patreon

Corrina Rachel

Corrina Rachel

patreon

Jhida

Jhida

gumroad

MOLVNO

MOLVNO

patreon

Bias

Bias

gumroad

Luri2

Luri2

patreon

Jade Jacky Kim

Jade Jacky Kim

gumroad

lolicaust

lolicaust

fanbox

Mimkon

Mimkon

patreon

Mokajake

Mokajake

patreon

Caedyn263

Caedyn263

patreon

innefable

innefable

patreon

edmol

edmol

patreon

ce201212010128

ce201212010128

fanbox

FOティー

FOティー

fanbox

Mustard

Mustard

patreon

watatanza

watatanza

fanbox