dmm | Let's understand Large Language Models better

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

This is a good starting point:

"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html

Flat | Top-Level Comments Only

From:

dmm

~45:00 there is this weird ambiguity when people only use embedding dimension when they talk about dimensionality of residual stream and omit the context length dimension; I can imagine this might bite us in many ways (not only in confusion, but in leading to wrong conclusions)

From:

dmm

I just double-checked his remark about MLP having 4 times more neurons than embedding in https://github.com/karpathy/minGPT/blob/master/mingpt/model.py and yes, it is the case there

(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)