![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
This is a good starting point:
"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html
"A Mathematical Framework for Transformer Circuits", Dec 2021
transformer-circuits.pub/2021/framework/index.html
no subject
Date: 2023-10-29 06:49 pm (UTC)no subject
Date: 2023-10-29 07:01 pm (UTC)(but we need to see how this works with context length, it's not very transparent in the code, which is inconvenient; in MLP it is even less transparent than in the attention layer, where they have to write it explicitly in connection with splitting into attention heads)
no subject
Date: 2023-10-29 07:19 pm (UTC)no subject
Date: 2023-10-29 07:22 pm (UTC)