Some new papers
Jun. 23rd, 2021 11:12 am"Aurochs: An Architecture for Dataflow Threads", by a team from Stanford.
They say they learned to do dataflow-style acceleration for hash tables, B-trees, and things like that. This might be an answer to my long-standing desire to have good parallelization for less regular and less uniform computations. And the best thing is that this is a software solution, one does not have to build specialized processors to take advantage of this:
conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300a402/333300a402.pdf
"Thinking Like Transformers", by a team from Israel
"What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language."
arxiv.org/abs/2106.06981
Столько всего происходит, времени вдруг стало резко меньше, не получается читать всё, что я обычно читаю... Дней десять назад как-то всё изменилось довольно резко, совсем другая динамика стала, feels like a transition period...
They say they learned to do dataflow-style acceleration for hash tables, B-trees, and things like that. This might be an answer to my long-standing desire to have good parallelization for less regular and less uniform computations. And the best thing is that this is a software solution, one does not have to build specialized processors to take advantage of this:
conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300a402/333300a402.pdf
"Thinking Like Transformers", by a team from Israel
"What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language."
arxiv.org/abs/2106.06981
Столько всего происходит, времени вдруг стало резко меньше, не получается читать всё, что я обычно читаю... Дней десять назад как-то всё изменилось довольно резко, совсем другая динамика стала, feels like a transition period...
no subject
Date: 2021-06-23 03:26 pm (UTC)no subject
Date: 2021-06-23 05:51 pm (UTC)"Attention is Turing-Complete", https://jmlr.org/papers/v22/20-302.html by a team from Chile
which should not be too different from
"On the Turing Completeness of Modern Neural Network Architectures": https://arxiv.org/abs/1901.03429 and https://openreview.net/forum?id=HyGBdo0qFm (ICLR 2019 poster)
and among other things it says "We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete." This needs to be looked at more closely, since we know that Turing completeness requires the capability to increase memory footprint in an unlimited fashion.
In any case, this might be quite worthwhile to study more closely...
no subject
Date: 2021-06-23 05:58 pm (UTC)https://github.com/tech-srl/RASP
no subject
Date: 2021-06-23 06:00 pm (UTC)(Code for running the transformers in the ICML 2021 paper "Thinking Like Transformers")
no subject
Date: 2021-06-23 06:06 pm (UTC)"This particular transformer was trained using both target and attention supervision,
i.e.: in addition to the standard cross entropy loss on the target output, the model was given an MSE-loss on the difference between its attention heatmaps and those expected by the RASP solution. The transformer reached test accuracy of 99:9% on the task, and comparing the selection patterns in (b) with the heatmaps in (c) suggests that it has also successfully learned to replicate the solution described in (a)."
suggests that we are talking about program inference here (and don't necessarily address program synthesis; although they do discuss work distilling automata from RNNs, so it might potentially be quite possible to distill RASP programs from Transformers)...
also, of course, it is possible that Transformers can do more than described in this paper...
no subject
Date: 2021-06-23 06:30 pm (UTC)rejected from ICLR 2021 (which does not mean much, since there is almost no correlation between actual quality of papers and reviewer evaluations if we believe the study recently posted by Scott Alexander): https://openreview.net/forum?id=TmkN9JmDJx1
discussed here: https://news.ycombinator.com/item?id=27528004
and here: https://www.reddit.com/r/mlscaling/comments/o1jkki/thinking_like_transformers/
no subject
Date: 2021-06-23 06:32 pm (UTC)functions rather than sequences."
I am not sure if this matters (or if this is good - neural models tend to be eager, not lazy; or, at least, this is how I tend to think about them: data are pushed as a flow, not pulled).
no subject
Date: 2021-06-23 06:48 pm (UTC)no subject
Date: 2021-06-23 06:55 pm (UTC)no subject
Date: 2021-06-23 07:05 pm (UTC)no subject
Date: 2021-06-23 07:07 pm (UTC)("Improving Transformer Models by Reordering their Sublayers", https://www.aclweb.org/anthology/2020.acl-main.270/ and https://arxiv.org/abs/1911.03864)
no subject
Date: 2021-06-23 07:08 pm (UTC)