Date: 2023-10-30 02:12 am (UTC)
dmm: (Default)
From: [personal profile] dmm
Moving towards understanding of attention heads...

1) Interestingly, what they seem to say is that splitting into attention heads is not just an efficiency device, but is semantically meaningful (it would be interesting to experiment with very small dimensions for attention heads, perhaps even as small as 1 (and also 2, etc)).

2) Interestingly, Neel Nanda thinks that using the tensor product formalism is a methodological mistake (it certainly does make the material more difficult to understand, but perhaps this might enable more powerful ways of thinking; anyway, this use of tensor products is, at least, optional).
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

dmm: (Default)
Dataflow matrix machines (by Anhinga anhinga)

May 2025

S M T W T F S
    123
456 78910
11 121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 21st, 2025 06:18 pm
Powered by Dreamwidth Studios