Entry tags:
Let's understand Large Language Models better
This is a good starting point:
"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html
"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html
no subject
that gives some crude proxy for what's going on
no subject
no subject
(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)
no subject
no subject