dmm | Let's understand Large Language Models better

Entry tags:

Let's understand Large Language Models better

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

Moving towards understanding of attention heads...

1) Interestingly, what they seem to say is that splitting into attention heads is not just an efficiency device, but is semantically meaningful (it would be interesting to experiment with very small dimensions for attention heads, perhaps even as small as 1 (and also 2, etc)).

2) Interestingly, Neel Nanda thinks that using the tensor product formalism is a methodological mistake (it certainly does make the material more difficult to understand, but perhaps this might enable more powerful ways of thinking; anyway, this use of tensor products is, at least, optional).

~ 1:17:30 not only does attention move info from residual stream of one token to another, then information accumulated from multiple residual streams of many tokens can be moved again (combining aggregation and compositionality)

Edited 2023-10-30 05:05 (UTC)

"1:30:18
we think you're grown up enough that you can figure out where it's useful to look and we're going to give you some
1:30:25
fraction of your premises I think it's like articles to attention something like uh
1:30:32
one-sixth of the parameters of the Transformer go to attention and we're like
1:30:37
these parameters to figure out where you should be moving information from what
1:30:42
does an intelligent worrying and an intelligent convolution look like and as we'll see later with induction
1:30:48
heads there can actually be like a pretty sophisticated and intelligent amount of computation
1:30:53
that goes into what this smart Dynamic convolution
1:30:58
looks like but yeah fundamentally attention is a generalized convolution where we allow
1:31:05
Transformers to compute how they ought to be moving information around for themselves"

around 1:54:00 his explanation why it is only trigrams and no more complicated interactions for one-layer situation does not look all that convincing

generally, I would not assume their explanations are complete, even for these small models

it makes better sense to think about their approach as a viewpoint, and not as "The Truth"

(and especially listening to his caveats near 1:57:00)

Edited 2023-10-30 15:47 (UTC)

Lots of copying though; it's a frequent motif

And another frequent motif is that these things are good with fixing the weirdness of tokenizers

2:01:00 and for more complicated models, it is useful to think that attention heads are doing a lot of skip trigrams and doing other things on top of that

Edited 2023-10-30 16:02 (UTC)

2:08:00 In addition to what they are saying about positive eigenvalues being much weaker than e.g. Adam Nemecek's paper is hoping for, here Neel Nanda is saying that even this does not really generalize to larger models

MOVING TOWARDS TWO-LAYER MODELS

ATTENTION: The correction (bug fix) in calculation of compositions is a relatively recent addition: according to the Wayback Machine this correction has been added between May 21 and May 24, 2023

IN PARTICULAR, that bug fix happened many months after Neel Nanda recorded his lecture, so expect this error to persist in his presentation.

An interactive interface: https://transformer-circuits.pub/2021/framework/2L_HP_normal.html

They say that compositions often copy sequences (exactly or approximately).

But one should really study the next paper: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

Flat | Top-Level Comments Only

Let's understand Large Language Models better

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject