dmm | Let's understand Large Language Models better

Entry tags:

Let's understand Large Language Models better

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

"1:30:18
we think you're grown up enough that you can figure out where it's useful to look and we're going to give you some
1:30:25
fraction of your premises I think it's like articles to attention something like uh
1:30:32
one-sixth of the parameters of the Transformer go to attention and we're like
1:30:37
these parameters to figure out where you should be moving information from what
1:30:42
does an intelligent worrying and an intelligent convolution look like and as we'll see later with induction
1:30:48
heads there can actually be like a pretty sophisticated and intelligent amount of computation
1:30:53
that goes into what this smart Dynamic convolution
1:30:58
looks like but yeah fundamentally attention is a generalized convolution where we allow
1:31:05
Transformers to compute how they ought to be moving information around for themselves"

around 1:54:00 his explanation why it is only trigrams and no more complicated interactions for one-layer situation does not look all that convincing

generally, I would not assume their explanations are complete, even for these small models

it makes better sense to think about their approach as a viewpoint, and not as "The Truth"

(and especially listening to his caveats near 1:57:00)

Edited 2023-10-30 15:47 (UTC)

Lots of copying though; it's a frequent motif

And another frequent motif is that these things are good with fixing the weirdness of tokenizers

2:01:00 and for more complicated models, it is useful to think that attention heads are doing a lot of skip trigrams and doing other things on top of that

Edited 2023-10-30 16:02 (UTC)

2:08:00 In addition to what they are saying about positive eigenvalues being much weaker than e.g. Adam Nemecek's paper is hoping for, here Neel Nanda is saying that even this does not really generalize to larger models

Flat | Top-Level Comments Only

Let's understand Large Language Models better

no subject

no subject

no subject

no subject