Active Entries
- 1: "Narrow AGI" this year?
- 2: Tao on coordinate vs coordinate-free math reasoning
- 3: "Aging as a loss of goal-directedness"
- 4: New integrated mode for GPT-4 in ChatGPT+
- 5: Китайский новый год начнётся 10-го февраля
- 6: Automating the Search for Artificial Life with Foundation Models
- 7: "Anatomy of a Formal Proof"
- 8: C to safe Rust automatic translation using LLMs and dynamic analysis
- 9: GonzoML
- 10: Transformers as a Computational Model (workshop)
Style Credit
- Style: Neutral Good for Practicality by
Expand Cut Tags
No cut tags
no subject
Date: 2023-10-30 02:12 am (UTC)1) Interestingly, what they seem to say is that splitting into attention heads is not just an efficiency device, but is semantically meaningful (it would be interesting to experiment with very small dimensions for attention heads, perhaps even as small as 1 (and also 2, etc)).
2) Interestingly, Neel Nanda thinks that using the tensor product formalism is a methodological mistake (it certainly does make the material more difficult to understand, but perhaps this might enable more powerful ways of thinking; anyway, this use of tensor products is, at least, optional).