dmm | Let's understand Large Language Models better

Entry tags:

Let's understand Large Language Models better

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

~29:00 it seems that most computations only go through a couple of layers (residual stream gives it the freedom to do this).

(So the bulk of computations are probably shallow, with a bit of "true deepness" sprinkled on top of it.)

~32:00 residual stream is really messy, so interpretation via meaningful paths through the model is the only viable way (thankfully, it works OK).

(perhaps people who conjecture about "holographic storage" within residual stream are right, who knows; one can consider improving it in various ways: a) towards detangling, b) alternatively, towards better holography)

Edited 2023-10-29 18:00 (UTC)

~36:00 privileged bases in vector spaces come from non-linearities (which is why the residual stream tends not to have any privileged basis)

(but, actually, positions are meaningful, so there is still a bit of privileged structure in the residual stream, just (perhaps) not within the embedding vectors (but perhaps even there, if we look closely, who knows))

~37:50 spectrum of how privileged a basis is, rather than a binary privileged vs non-privileged

(the truth is there are traces of various privileges in the residual stream as well)

~39:30 even ADAM privileges everything it interacts with, because of its weirdness ("ADAM sucks" says Neel Nanda, but I don't think it's necessarily so, perhaps this artificial thing is good, who knows(!)).

Edited 2023-10-29 18:26 (UTC)

~43:00 virtual weights - it's not really between layers, it's between attention heads (just see how (if at all) to take into account a particular attention head output by a particular attention head input)

that gives some crude proxy for what's going on

Edited 2023-10-29 18:45 (UTC)

~45:00 there is this weird ambiguity when people only use embedding dimension when they talk about dimensionality of residual stream and omit the context length dimension; I can imagine this might bite us in many ways (not only in confusion, but in leading to wrong conclusions)

I just double-checked his remark about MLP having 4 times more neurons than embedding in https://github.com/karpathy/minGPT/blob/master/mingpt/model.py and yes, it is the case there

(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)

53:50 Transformers figure out how to clean-up unnecessary leftovers from residual stream

56:00 he speculates that embedding and unembedding only using a fraction of residual stream dimensionality

Flat | Top-Level Comments Only

Let's understand Large Language Models better

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject