dmm | Some new papers

"Aurochs: An Architecture for Dataflow Threads", by a team from Stanford.

They say they learned to do dataflow-style acceleration for hash tables, B-trees, and things like that. This might be an answer to my long-standing desire to have good parallelization for less regular and less uniform computations. And the best thing is that this is a software solution, one does not have to build specialized processors to take advantage of this:

conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300a402/333300a402.pdf

"Thinking Like Transformers", by a team from Israel

"What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language."

arxiv.org/abs/2106.06981

Столько всего происходит, времени вдруг стало резко меньше, не получается читать всё, что я обычно читаю... Дней десять назад как-то всё изменилось довольно резко, совсем другая динамика стала, feels like a transition period...

Flat | Top-Level Comments Only

From:

dmm

Эти статьи явно стоят того, чтобы разобрать их более подробно, и я их вполне полистал, но не разобрал и не освоил - now this is a part of my "to-do list".

The "Thinking like Transformer" references this paper:

"Attention is Turing-Complete", https://jmlr.org/papers/v22/20-302.html by a team from Chile

which should not be too different from

"On the Turing Completeness of Modern Neural Network Architectures": https://arxiv.org/abs/1901.03429 and https://openreview.net/forum?id=HyGBdo0qFm (ICLR 2019 poster)

and among other things it says "We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete." This needs to be looked at more closely, since we know that Turing completeness requires the capability to increase memory footprint in an unlimited fashion.

In any case, this might be quite worthwhile to study more closely...

The "Thinking like Transformers" REPL for RASP (Restricted Access Sequence Processing Language):

https://github.com/tech-srl/RASP

And then also this: https://github.com/tech-srl/RASP-exps

(Code for running the transformers in the ICML 2021 paper "Thinking Like Transformers")

this part of the legend to Figure 1

"This particular transformer was trained using both target and attention supervision,
i.e.: in addition to the standard cross entropy loss on the target output, the model was given an MSE-loss on the difference between its attention heatmaps and those expected by the RASP solution. The transformer reached test accuracy of 99:9% on the task, and comparing the selection patterns in (b) with the heatmaps in (c) suggests that it has also successfully learned to replicate the solution described in (a)."

suggests that we are talking about program inference here (and don't necessarily address program synthesis; although they do discuss work distilling automata from RNNs, so it might potentially be quite possible to distill RASP programs from Transformers)...

also, of course, it is possible that Transformers can do more than described in this paper...

Edited Date: 2021-06-23 06:16 pm (UTC)

what else do we know about this paper:

rejected from ICLR 2021 (which does not mean much, since there is almost no correlation between actual quality of papers and reviewer evaluations if we believe the study recently posted by Scott Alexander): https://openreview.net/forum?id=TmkN9JmDJx1

discussed here: https://news.ycombinator.com/item?id=27528004

and here: https://www.reddit.com/r/mlscaling/comments/o1jkki/thinking_like_transformers/

page 3: "RASP programs are lazy functional, and thus operate on
functions rather than sequences."

I am not sure if this matters (or if this is good - neural models tend to be eager, not lazy; or, at least, this is how I tend to think about them: data are pushed as a flow, not pulled).

(DMMs are certainly more flexible than this; they can do things this way, but there are all kinds of other idioms we love using within DMMs... This does correspond rather well to the idea that Transformers are a fairly restricted, but a very machine-learning-convenient subclass of DMMs...)

(And their Transformers are not universal, although they do reference "Universal Transformers" paper, https://openreview.net/forum?id=HyzdRiR9Y7 and https://arxiv.org/abs/1807.03819, noting that "transformers cannot arbitrarily repeat operations", but that the authors of "Universal Transformers" paper "devise a transformer architecture with a control unit, which can repeat its sublayers arbitrarily many times".)

What they say on "Restricted-Attention Transformers" and sorting on page 6 (that restricted attention implies reduced power), I don't buy. Their logic looks quite faulty to me...

The discussion of "Sandwich Transformers" is super-interesting.

("Improving Transformer Models by Reordering their Sublayers", https://www.aclweb.org/anthology/2020.acl-main.270/ and https://arxiv.org/abs/1911.03864)

Page 7: "Symbolic Reasoning in Transformers" is interesting.

Dataflow matrix machines (by Anhinga anhinga)

Some new papers

Some new papers

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

December 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags