dmm: (0)
Dataflow matrix machines (by Anhinga anhinga) ([personal profile] dmm) wrote 2023-10-30 05:04 am (UTC)

~ 1:17:30 not only does attention move info from residual stream of one token to another, then information accumulated from multiple residual streams of many tokens can be moved again (combining aggregation and compositionality)

Post a comment in response:

This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting