dmm | Let's understand Large Language Models better

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

From:

dmm

Video lecture by Neel Nanda (Oct 2022): https://www.youtube.com/watch?v=KV5gbOmHbjU

via https://twitter.com/NeelNanda5/status/1580782930304978944

From:

dmm

"important implications of this residual stream is the sum of paths
29:36
um or model functionality as a sum of paths via the residual stream notion"

From:

dmm

"um by projecting it back to the residual stream um one thing to emphasize is I don't
40:32
actually like the words reading and writing here because I think they can be pretty misleading but in particular
40:38
reading and writing intuitively feel like inverses or complementary operations but they're actually very
40:44
different so I prefer the word um project for read and embed for write"

From:

dmm

"um side note basically every Transformer I've come across they just hard code the number of
45:34
MLP neurons as four times the residual stream width I don't know why but
45:39
everyone does it so you just memorize the number four"

From:

dmm

"In many cases, we've found it helpful to reframe transformers in equivalent, but non-standard ways. Mechanistic interpretability requires us to break models down into human-interpretable pieces. An important first step is finding the representation which makes it easiest to reason about the model. In modern deep learning, there is — for good reason! — a lot of emphasis on computational efficiency, and our mathematical descriptions of models often mirror decisions in how one would write efficient code to run the model. But when there are many equivalent ways to represent the same computation, it is likely that the most human-interpretable representation and the most computationally efficient representation will be different."

From:

dmm

"An especially useful consequence of the residual stream being linear is that one can think of implicit "virtual weights" directly connecting any pair of layers (even those separated by many other layers), by multiplying out their interactions through the residual stream. These virtual weights are the product of the output weights of one layer with the input weights of another (ie. W_{I}^2W_{O}^1), and describe the extent to which a later layer reads in the information written by a previous layer."

From:

dmm

"Perhaps because of this high demand on residual stream bandwidth, we've seen hints that some MLP neurons and attention heads may perform a kind of "memory management" role, clearing residual stream dimensions set by other layers by reading in information and writing out the negative version."

footnote: "Some MLP neurons have very negative cosine similarity between their input and output weights, which may indicate deleting information from the residual stream. Similarly, some attention heads have large negative eigenvalues in their W_OW_V matrix and primarily attend to the present token, potentially serving as a mechanism to delete information. It's worth noticing that while these may be generic mechanisms for "memory management" deletion of information, they may also be mechanisms for conditionally deleting information, operating only in some cases."

From:

dmm

Revisiting Neel Nanda lecture (and the paper itself).

~10:00 (anecdotally) an attempt to interpret small visual models on MNIST by Chris Olah did not work, but visual models became more interpretable when they got larger.

In Transformers, smaller models are easier to understand, but this is by no means obvious (says Neel Nanda in that lecture, but who knows how this would change eventually; in any case, the knowledge thus acquired does seem to be transferrable OK to larger models).

Edited Date: 2023-10-29 04:26 pm (UTC)

From:

dmm

~24:50 Ah, this is why we predict not only the next token, but all tokens after all preceding partial contexts.

This is useless at inference, but this works great to parallelize training.

From:

dmm

~29:00 it seems that most computations only go through a couple of layers (residual stream gives it the freedom to do this).

(So the bulk of computations are probably shallow, with a bit of "true deepness" sprinkled on top of it.)

From:

dmm

~32:00 residual stream is really messy, so interpretation via meaningful paths through the model is the only viable way (thankfully, it works OK).

(perhaps people who conjecture about "holographic storage" within residual stream are right, who knows; one can consider improving it in various ways: a) towards detangling, b) alternatively, towards better holography)

Edited Date: 2023-10-29 06:00 pm (UTC)

From:

dmm

~36:00 privileged bases in vector spaces come from non-linearities (which is why the residual stream tends not to have any privileged basis)

(but, actually, positions are meaningful, so there is still a bit of privileged structure in the residual stream, just (perhaps) not within the embedding vectors (but perhaps even there, if we look closely, who knows))

~37:50 spectrum of how privileged a basis is, rather than a binary privileged vs non-privileged

(the truth is there are traces of various privileges in the residual stream as well)

~39:30 even ADAM privileges everything it interacts with, because of its weirdness ("ADAM sucks" says Neel Nanda, but I don't think it's necessarily so, perhaps this artificial thing is good, who knows(!)).

Edited Date: 2023-10-29 06:26 pm (UTC)

From:

dmm

~43:00 virtual weights - it's not really between layers, it's between attention heads (just see how (if at all) to take into account a particular attention head output by a particular attention head input)

that gives some crude proxy for what's going on

Edited Date: 2023-10-29 06:45 pm (UTC)

From:

dmm

~45:00 there is this weird ambiguity when people only use embedding dimension when they talk about dimensionality of residual stream and omit the context length dimension; I can imagine this might bite us in many ways (not only in confusion, but in leading to wrong conclusions)

From:

dmm

I just double-checked his remark about MLP having 4 times more neurons than embedding in https://github.com/karpathy/minGPT/blob/master/mingpt/model.py and yes, it is the case there

(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)

From:

dmm

53:50 Transformers figure out how to clean-up unnecessary leftovers from residual stream

From:

dmm

56:00 he speculates that embedding and unembedding only using a fraction of residual stream dimensionality

From:

dmm

Moving towards understanding of attention heads...

1) Interestingly, what they seem to say is that splitting into attention heads is not just an efficiency device, but is semantically meaningful (it would be interesting to experiment with very small dimensions for attention heads, perhaps even as small as 1 (and also 2, etc)).

2) Interestingly, Neel Nanda thinks that using the tensor product formalism is a methodological mistake (it certainly does make the material more difficult to understand, but perhaps this might enable more powerful ways of thinking; anyway, this use of tensor products is, at least, optional).

From:

dmm

~ 1:17:30 not only does attention move info from residual stream of one token to another, then information accumulated from multiple residual streams of many tokens can be moved again (combining aggregation and compositionality)

Edited Date: 2023-10-30 05:05 am (UTC)

From:

dmm

"1:30:18
we think you're grown up enough that you can figure out where it's useful to look and we're going to give you some
1:30:25
fraction of your premises I think it's like articles to attention something like uh
1:30:32
one-sixth of the parameters of the Transformer go to attention and we're like
1:30:37
these parameters to figure out where you should be moving information from what
1:30:42
does an intelligent worrying and an intelligent convolution look like and as we'll see later with induction
1:30:48
heads there can actually be like a pretty sophisticated and intelligent amount of computation
1:30:53
that goes into what this smart Dynamic convolution
1:30:58
looks like but yeah fundamentally attention is a generalized convolution where we allow
1:31:05
Transformers to compute how they ought to be moving information around for themselves"

From:

dmm

around 1:54:00 his explanation why it is only trigrams and no more complicated interactions for one-layer situation does not look all that convincing

generally, I would not assume their explanations are complete, even for these small models

it makes better sense to think about their approach as a viewpoint, and not as "The Truth"

(and especially listening to his caveats near 1:57:00)

Edited Date: 2023-10-30 03:47 pm (UTC)

From:

dmm

Lots of copying though; it's a frequent motif

And another frequent motif is that these things are good with fixing the weirdness of tokenizers

2:01:00 and for more complicated models, it is useful to think that attention heads are doing a lot of skip trigrams and doing other things on top of that

Edited Date: 2023-10-30 04:02 pm (UTC)

From:

dmm

2:08:00 In addition to what they are saying about positive eigenvalues being much weaker than e.g. Adam Nemecek's paper is hoping for, here Neel Nanda is saying that even this does not really generalize to larger models

From:

dmm

MOVING TOWARDS TWO-LAYER MODELS

ATTENTION: The correction (bug fix) in calculation of compositions is a relatively recent addition: according to the Wayback Machine this correction has been added between May 21 and May 24, 2023

From:

dmm

IN PARTICULAR, that bug fix happened many months after Neel Nanda recorded his lecture, so expect this error to persist in his presentation.

An interactive interface: https://transformer-circuits.pub/2021/framework/2L_HP_normal.html

From:

dmm

They say that compositions often copy sequences (exactly or approximately).

But one should really study the next paper: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

From:

dmm

That paper also has this phrase:

"Ultimately, our goal in this initial paper is simply to establish a foothold for future efforts on this problem. Much future work remains to be done.'

(The whole field is probably larger than what one person can even overview at this point, the trick is to navigate through this material in a fruitful way).