dmm | К 9 месяцам с появления GPT-3

Моя профессиональная деятельность в последние девять месяцев вся была окрашена прорывом, связанным с тем, что придумали GPT-3, и оказалось, что у этой штуки уже вполне волшебные свойства.

Вот, я хочу в комментариях проследить, как оно было, и что я по этому поводу пробовал делать (в том числе, на гитхабе).

Революция, вызванная или, по крайней мере, радикально ускоренная появлением GPT-3 и последующих работ, происходит вовсю, и я не уверен, получается ли у кого-нибудь следить за всеми важными развитиями в этой области. Я не делаю попытку обзора, это, скорее, попытка вспомнить свою личную траекторию.

Threaded | Top-Level Comments Only

From:

dmm

May 20: "OpenAI code generation breakthrough": https://dmm.dreamwidth.org/27542.html

Демо оказалось вполне настоящее, с тех пор многие делали разные похожие вещи с помощью GPT-3.

From:

dmm

May 28: GPT-3 paper, "Language Models are Few-Shot Learners": https://dmm.dreamwidth.org/28298.html

From:

dmm

June 11: OpenAI announces OpenAI API in "semi-private beta": https://openai.com/blog/openai-api/

It's really difficult to get access, but a number of people do get access, and it soon becomes quite clear that the system is everything it is advertised to be, and more. (One needs to "play with it" to make it do what one wants, but it's doable.)

Edited Date: 2021-03-01 07:42 am (UTC)

From:

dmm

July 12: "Следующее поколение систем машинного обучения - гибриды с архитектурой Transformer": https://dmm.dreamwidth.org/31181.html

До меня потихоньку дошло, что будущее за гибридами самых разнообразных методов с Transformers и с другими attention-based models.

***

'Видимо, неизбежен следующий период гибридизации - разнообразных привнесений мотивов иерархического внимания во всевозможные ситуации и модели, с которыми работают люди. (Вообще говоря, даже необязательно, чтобы они были уж всегда такими огромными, я думаю, есть место и для маленьких изящных модификаций идеи иерархического внимания.)'

'Одно из странных соображений, которое недавно пришло мне в голову по этому поводу, это что есть много общего по духу между структурами иерархического внимания в этих моделях, и регуляцией экспрессии генов в клетках'

'Да, кстати, группа из Salesforce Research показала, что "the Transformer's attention mechanism recovers high-level structural (folding) and functional properties of proteins", в дополнение к всякой другой магии, достижимой с использованием этого класса моделей'

From:

dmm

July 15: Kaj Sotala начинает публиковать свою коллекцию разнообразных отчетов разных людей о том, как они используют GPT-3 (там достаточно примеров разнообразной кодогенерации): https://twitter.com/xuenay/status/1283312640199196673

From:

dmm

July 15: I publish my formulation that GPT-3 is "the AlexNet moment of the Transformer revolution" (or, perhaps, even a more significant event than AlexNet), and that hybrid models are the future:

https://www.cs.brandeis.edu/~bukatin/transformer_revolution.html

From:

dmm

July 20: I create a GitHub directory for my studies of various issues related to attention-based models: https://github.com/anhinga/2020-notes/tree/master/attention-based-models

This is still an active project with commits happening from time to time.

From:

dmm

July 21: Comment on the simple-minded pragmatic attention ("content-based neural attention"): https://github.com/anhinga/2020-notes/blob/master/attention-based-models/simple-minded-attention.md

I am explaining the intuition behind "content-based attention" - why it is natural to think about linear combination as "attention" (I did not understand this before; this explanation is the result of my new focus on these issues.)

From:

dmm

July 22: Transformer attention: https://github.com/anhinga/2020-notes/blob/master/attention-based-models/transformer-attention.md

This explains how attention is used in Transformers.

Further updated on August 5.

From:

dmm

July 23: Transformer Mystery: the Unreasonable Effectiveness of Multilayer Attention Models: https://github.com/anhinga/2020-notes/blob/master/attention-based-models/transformer-mystery.md

Further updated on July 25.

From:

dmm

July 23: Possible attention-related experiments: https://github.com/anhinga/2020-notes/blob/master/attention-based-models/possible-experiments.md

Starting making notes on possible attention-related experiments:

1)"Adding feedback to Transformers"

2) "DMM neural achitecture search" (здесь, кажется, впервые в явном виде возникают наблюдения о тесных связях между DMMs и Transformers: и те и другие основаны на линейных комбинациях, и объединение многих линейных комбинаций в матричное умножение тоже является важным мотивом и там и там.

На самом деле, эти тесные связи и являются ключевым мотивом для моего продолжающегося внимания к теме attention and Transformers - всё это гораздо ближе к тому, чем я и так занимаюсь, чем очевидно на первый взгляд. И есть надежда на разнообразное плодотворное взаимодействие между этими темами.)

July 28: added "Semantic grounding experiments"

July 31: added "Higher-order attention" (another section which is closely related to DMMs and which inspired quite a bit later)

Edited Date: 2021-03-02 06:56 am (UTC)

From:

dmm

July 31: People create the first version of a separate Wikipedia page on GPT-3 (Generative Pretrained Transformer 3): https://en.wikipedia.org/w/index.php?title=GPT-3&oldid=970533674

I was not aware of this page until recently: https://en.wikipedia.org/wiki/GPT-3

From:

dmm

July 29: Conference on Intelligent Computer Mathematics: July 27-30: https://dmm.dreamwidth.org/30661.html

Invited talk https://easychair.org/smart-program/CICM-13/2020-07-29.html#talk:155840

Christian Szegedy gives an invited talk "A Promising Path Towards Autoformalization and General Artificial Intelligence"

"ABSTRACT. An autoformalization system is an AI that learns to read natural language content and to turn it into an abstract, machine verifiable formalization, ideally by bootstrapping from unlabeled training data with minimum human interaction. This is a difficult task in general, one that would require strong automated reasoning and automated natural language processing capabilities. In this paper, it is argued that autoformalization is a promising path for systems to learn sophisticated, general purpose reasoning in all domains of mathematics and computer science. This could have far reaching implications not just for mathematical research, but also for software synthesis. Here I provide the outline for a realistic path towards those goals and give a survey of recent results that support the feasibility of this direction."

In reality, my impression was that the talk was first of all about applications of Transformers to all this (he is a huge fan of Transformers). But it might be that this was what I mostly remembered.

From:

dmm

August 5-7: Added section "DMM and Trasnformers" to my "Interdisciplinary collaborative research agenda" (which I upgraded in connection with having a talk accepted for a conference in September):

https://github.com/anhinga/2020-notes/tree/master/research-agenda

The "DMMs and Transformers" section has not changed much since then, although the change is probably overdue.

It ends with this:

So far, we have been focusing this exploration along two dimensions:

  * Could what we know about DMMs shed some light on the remarkable properties of Transformers?

  * What are the ways to incorporate key elements from Transformer architecture into a more flexible DMM setup,
    and, in particular, could we obtain interesting compact and low training cost models by incorporating 
    attention-inspired and Transformer-inspired motives into DMMs?

Edited Date: 2021-03-02 06:53 am (UTC)

From:

dmm

To be continued; this essentially covers first 2-2.5 months of my GPT-3 inspired activity.

There were two more conferences in July, so some time was spent on other matters:

Applied Category Theory Conference: July 5-10, https://dmm.dreamwidth.org/30203.html

Judging by my comments to that post, I liked quite a bit of it back then, but I don't remember much now :-( Math is often like this; it is tempting, but then you don't study in that particular direction and you mostly forget (a more active work is necessary to form a long-term relation with the material).

Julia Conference (online, free): July 24-31, https://dmm.dreamwidth.org/30245.html

That was important, especially the SciML 4-hour tutoral ( https://twitter.com/ComputingByArts/status/1287501509127811073 ), but also a number of other things.

Edited Date: 2021-03-01 08:44 am (UTC)

From:

dmm

August 9: "Implicit MetaLearning": https://github.com/anhinga/2020-notes/blob/master/attention-based-models/implicit-metalearning.md

"The fact that inexpensive transfer learning or inexpensive fine-tuning is possible in many machine learning models such as neural nets and transformers implies that implicit metalearning on the level of model is happening during their training or pretraining."

GPT-3 seems to do this even more, in a style which somewhat resembles MAML (Model-Agnostic Meta-Learning), although it was not planned this way, but the authors are trying to explain this effect.

Edited Date: 2021-03-02 07:04 am (UTC)

From:

dmm

All this time I was reading tons of new papers on Transformers.

I also made a slide deck and gave a talk on Sep 3, "Higher-order neuromorphic computations with linear streams": https://github.com/anhinga/2020-notes/tree/master/CCC-2020 (particularly emphasizing mathematical side of the situation and recalling the material from our 2013-2015 research)

I ended up recording two most important papers of those which I've read here, both on Efficient Transformers:

Sep 27: "Selected less known papers on Transformers": https://github.com/anhinga/2020-notes/blob/master/attention-based-models/selected-papers.md

Edited Date: 2021-03-02 08:53 am (UTC)

From:

dmm

Sep 27 - Sep 30: "Theory problems on DMMs as Transformer-like machines": https://github.com/anhinga/2020-notes/blob/master/attention-based-models/theory-problems.md

Here I am reviving my June 2019 preprint on duality between the matrix network weights and the matrix of input vectors in DMMs. Back in 2019 it was "some theory work to revisit in the future", but Transformer's focus on matrix multiplication made me to move it to the forefront. Suddenly it started to look as potentially the most promising direction to explore.

Now we are at about 4 months since GPT-3 revolution.

From:

dmm

The next two months (Oct-Nov) I did a bunch of work which was only committed/posted on internet in December (and some in January).

It involved playing with the idea of a machine built around bilinear operation of matrix multiplication interleaved with non-linear transform.

In particular, the first experiments interpreting monochrome images as matrices and multiplying them as matrices were done in October (this was done in Julia; I was playing with Julia till late May, then I stopped, and switched to thinking about Transformers, in October I resumed playing with Julia).

In December I reproduced those experiments inside Jupyter notebook (my first Jupyter notebook in Julia), and I committed the Jupyter notebook version.

So, this was not too fast: Oct-Nov was a slow time of thinking and software explorations, and by the end of November we find ourselves at 6 months since GPT-3 revolution.

From:

dmm

Dec 3 - Jan 16: notes on "Matrix Multiplication Machines":

"Machines with matrix multiplication as a key element", https://github.com/anhinga/2020-notes/blob/master/attention-based-models/matrix-mult-machines.md

Dec 8: committed that Julia Jupyter notebook to https://github.com/anhinga/github-tests

(Feb 2: Made a blog post about this notebook: "Умножая монохромные картинки как матрицы": https://dmm.dreamwidth.org/36512.html )

Jan 13-Jan 15: created https://github.com/anhinga/julia-notebooks/tree/main/images-as-matrices and did a scaling study to make sure that the observed effects are robust with respect to scaling, published that study as a Julia Jupyter notebook in images-as-matrices directory on GitHub.

Edited Date: 2021-03-02 07:40 am (UTC)

From:

dmm

We are reasonably close to the end of the story. There is one more thread: about relationship between DMMs and Differential Programming, as I am thinking about my current activity in the context of that relationship.

And also here JAX comes into the picture and will exist in it on par with Julia.

November 30: DeepMind announces that it has "solved" the problem of protein folding (it seems to be solved in the sense that the performance on par with best human labs, but much faster time-wise, seems to have been reached; however, it is not clear how to start addressing the question whether this system can also solve protein folding problems which humans can't solve at all: how do we even start testing that?)

In any case, the system, Alpha Fold 2, is a hybrid model with attention at its center, so my prediction that hybrids with attention are the future seems to be working well: https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology

December 4: DeepMind publishes a blog post describing DeepMind JAX ecosystem: https://deepmind.com/blog/article/using-jax-to-accelerate-our-research

This made me realize by late December that JAX and Julia Flux are of comparable flexibility: they both are the next generation ultra-flexible machine learning frameworks, they allow to compute gradients of large subsets of Python and Julia (there are some requirements for immutability of arrays in both cases, and so one is encouraged to move towards functional programming with immutable data), they allow to compute gradients with respect to tree-like structures and not just with respect to flat "tensors", and they do a lot to speed things up and to allow interoperability with all kinds of things. People have flamewars over the JAX vs Julia Flux issues, and assert superiority of the one of these two systems they like, but honestly, while these two systems are different, they are pretty competitive with each other; the trade-offs are rather complicated.

Edited Date: 2021-03-02 08:10 am (UTC)

From:

dmm

So, to explain the background better, we need to step back to that July Julia conference, where I understood what's cool about so-called "universal differential equations".

There is nothing "universal" about them, but the parts of the system of differential equations which are not well-known are replaced by modest feed-forward neural nets (not unlike the connectors in Transformers), and the word "universal" comes from the fact that feed-forward neural nets are universal approximators of large classes of functions, so if you don't know a function in a particular part of the right-hand-side well, it's cool to replace it with a feed-forward net, and to let the system figure it out, while also finding other parameters of the "neural differential equation" in question.

This way one gets nice compact models with strong biases from the structure of the system of differential equations.

Now, if one replaces "differential equation" with "arbitrary differentiable program" (note that occasional discontinuity and non-smoothness is allowed, as in e.g. ReLU or if), and a feed-forward net with a model or a piece of model one feels like including into one's differentiable program, then one obtains the "setup of differentiable programming".

So, neural nets, other machine learning models, differential equations, matrix multiplications, various pieces of DMMs, all can be included into "differentiable programs".

If one uses DMMs as a "pure formalism" for differentiable programming, the main advantage one gets from that is better metalearning.

In differentiable programming, full-scale metalearning includes program synthesis, and program synthesis is difficult. Here using DMMs as a formalism for differentiable programming should yield advantage.

But one does not have to do this all at once, one can start with a nice differentiable programming system like Julia Flux or JAX/Python and incorporate motives from DMMs into that piecemeal, and eventually consider gradually switching to a more pure DMM-based formalism with better metalearning properties.

Edited Date: 2021-03-03 05:58 pm (UTC)

From:

dmm

So, in late December-early January I considered whether one might nevertheless want to implement DMMs in an old-fashioned framework like PyTorch or TensorFlow (it might make sense if people involved with those old-fashioned frameworks would want to invest in such an effort, as this is a relatively inexpensive way to bring flexibility and differential programming into those old-fashioned frameworks without rearchitecting them drastically).

Jan 4: I publish a research draft about this: "Dataflow matrix machines, tree-shaped flexible tensors, neural architecture search, and PyTorch", https://github.com/anhinga/2021-notes/tree/main/research-drafts

Feb 11-17: I upgrade my "interdisciplinary collaborative research agenda", adding information about JAX, and also adding Section 7: " DMMs vs. differential (differentiable) programming: a meta-learning aspect":

https://github.com/anhinga/2021-notes/tree/main/research-agenda

https://www.cs.brandeis.edu/~bukatin/dmm-collaborative-research-agenda.pdf

Edited Date: 2021-03-03 05:57 pm (UTC)

From:

dmm

Feb 28-March 2:

I publish "9 months since GPT-3 revolution": https://anhinga-anhinga.livejournal.com/84392.html and there is useful discussion in the comments there.

I publish this blog post, К 9 месяцам с появления GPT-3", and these comments.

So here we are, 9 months after GPT-3 revlution...

(It might be that nothing more needs to be done by people like me, it might be that some smart people elsewhere will make enough breakthroughs in the next few months to "solve AI", and we can just hope that they are thinking well about "AI safety", but we can't really participate. But if not, continuing this line of research should be of interest, and so I am going to continue working on this (I do spend time looking at what people are writing on "AI safety" recently, e.g. this https://dmm.dreamwidth.org/36635.html and some other texts; I think we should at least ponder "AI safety" issues, especially if we are doing work which might turn out to be relevant to "transition to 'true AI' " and which might impact the dynamics and properties of this "transition").)

So, one might want to just experiment with various aspects of DMMs and matrix multiplications and other ideas which are coming to one's mind in the context of DMMs, Transformers, attention-based models, and their interplay, and one should do this within one of the modern ultra-flexible frameworks for differentiable programming, such as Julia Flux or JAX, and this line of exploration has a good chance of being fruitful.

Edited Date: 2021-03-02 08:49 am (UTC)

Threaded | Top-Level Comments Only

Profile

Dataflow matrix machines (by Anhinga anhinga)

Neuromorphic Computations with Linear Streams

September 2025

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Page Summary

Active Entries

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Dec. 28th, 2025 03:49 pm

Dataflow matrix machines (by Anhinga anhinga)

К 9 месяцам с появления GPT-3

К 9 месяцам с появления GPT-3

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

September 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags