dmm | Let's understand Large Language Models better

Entry tags:

Let's understand Large Language Models better

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

~43:00 virtual weights - it's not really between layers, it's between attention heads (just see how (if at all) to take into account a particular attention head output by a particular attention head input)

that gives some crude proxy for what's going on

Edited 2023-10-29 18:45 (UTC)

~45:00 there is this weird ambiguity when people only use embedding dimension when they talk about dimensionality of residual stream and omit the context length dimension; I can imagine this might bite us in many ways (not only in confusion, but in leading to wrong conclusions)

I just double-checked his remark about MLP having 4 times more neurons than embedding in https://github.com/karpathy/minGPT/blob/master/mingpt/model.py and yes, it is the case there

(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)

53:50 Transformers figure out how to clean-up unnecessary leftovers from residual stream

56:00 he speculates that embedding and unembedding only using a fraction of residual stream dimensionality

Flat | Top-Level Comments Only

Let's understand Large Language Models better

no subject

no subject

no subject

no subject

no subject