К 9 месяцам с появления GPT-3
Mar. 1st, 2021 02:03 amМоя профессиональная деятельность в последние девять месяцев вся была окрашена прорывом, связанным с тем, что придумали GPT-3, и оказалось, что у этой штуки уже вполне волшебные свойства.
Вот, я хочу в комментариях проследить, как оно было, и что я по этому поводу пробовал делать (в том числе, на гитхабе).
Революция, вызванная или, по крайней мере, радикально ускоренная появлением GPT-3 и последующих работ, происходит вовсю, и я не уверен, получается ли у кого-нибудь следить за всеми важными развитиями в этой области. Я не делаю попытку обзора, это, скорее, попытка вспомнить свою личную траекторию.
Вот, я хочу в комментариях проследить, как оно было, и что я по этому поводу пробовал делать (в том числе, на гитхабе).
Революция, вызванная или, по крайней мере, радикально ускоренная появлением GPT-3 и последующих работ, происходит вовсю, и я не уверен, получается ли у кого-нибудь следить за всеми важными развитиями в этой области. Я не делаю попытку обзора, это, скорее, попытка вспомнить свою личную траекторию.
no subject
Date: 2021-03-01 07:25 am (UTC)Демо оказалось вполне настоящее, с тех пор многие делали разные похожие вещи с помощью GPT-3.
no subject
Date: 2021-03-01 07:27 am (UTC)no subject
Date: 2021-03-01 07:36 am (UTC)It's really difficult to get access, but a number of people do get access, and it soon becomes quite clear that the system is everything it is advertised to be, and more. (One needs to "play with it" to make it do what one wants, but it's doable.)
no subject
Date: 2021-03-01 07:41 am (UTC)До меня потихоньку дошло, что будущее за гибридами самых разнообразных методов с Transformers и с другими attention-based models.
***
'Видимо, неизбежен следующий период гибридизации - разнообразных привнесений мотивов иерархического внимания во всевозможные ситуации и модели, с которыми работают люди. (Вообще говоря, даже необязательно, чтобы они были уж всегда такими огромными, я думаю, есть место и для маленьких изящных модификаций идеи иерархического внимания.)'
'Одно из странных соображений, которое недавно пришло мне в голову по этому поводу, это что есть много общего по духу между структурами иерархического внимания в этих моделях, и регуляцией экспрессии генов в клетках'
'Да, кстати, группа из Salesforce Research показала, что "the Transformer's attention mechanism recovers high-level structural (folding) and functional properties of proteins", в дополнение к всякой другой магии, достижимой с использованием этого класса моделей'
no subject
Date: 2021-03-01 07:48 am (UTC)no subject
Date: 2021-03-01 07:52 am (UTC)https://www.cs.brandeis.edu/~bukatin/transformer_revolution.html
no subject
Date: 2021-03-01 07:56 am (UTC)This is still an active project with commits happening from time to time.
no subject
Date: 2021-03-01 08:00 am (UTC)I am explaining the intuition behind "content-based attention" - why it is natural to think about linear combination as "attention" (I did not understand this before; this explanation is the result of my new focus on these issues.)
no subject
Date: 2021-03-01 08:03 am (UTC)This explains how attention is used in Transformers.
Further updated on August 5.
no subject
Date: 2021-03-01 08:05 am (UTC)Further updated on July 25.
no subject
Date: 2021-03-01 08:15 am (UTC)Starting making notes on possible attention-related experiments:
1)"Adding feedback to Transformers"
2) "DMM neural achitecture search" (здесь, кажется, впервые в явном виде возникают наблюдения о тесных связях между DMMs и Transformers: и те и другие основаны на линейных комбинациях, и объединение многих линейных комбинаций в матричное умножение тоже является важным мотивом и там и там.
На самом деле, эти тесные связи и являются ключевым мотивом для моего продолжающегося внимания к теме attention and Transformers - всё это гораздо ближе к тому, чем я и так занимаюсь, чем очевидно на первый взгляд. И есть надежда на разнообразное плодотворное взаимодействие между этими темами.)
July 28: added "Semantic grounding experiments"
July 31: added "Higher-order attention" (another section which is closely related to DMMs and which inspired quite a bit later)
no subject
Date: 2021-03-01 08:21 am (UTC)I was not aware of this page until recently: https://en.wikipedia.org/wiki/GPT-3
no subject
Date: 2021-03-01 08:27 am (UTC)Invited talk https://easychair.org/smart-program/CICM-13/2020-07-29.html#talk:155840
Christian Szegedy gives an invited talk "A Promising Path Towards Autoformalization and General Artificial Intelligence"
"ABSTRACT. An autoformalization system is an AI that learns to read natural language content and to turn it into an abstract, machine verifiable formalization, ideally by bootstrapping from unlabeled training data with minimum human interaction. This is a difficult task in general, one that would require strong automated reasoning and automated natural language processing capabilities. In this paper, it is argued that autoformalization is a promising path for systems to learn sophisticated, general purpose reasoning in all domains of mathematics and computer science. This could have far reaching implications not just for mathematical research, but also for software synthesis. Here I provide the outline for a realistic path towards those goals and give a survey of recent results that support the feasibility of this direction."
In reality, my impression was that the talk was first of all about applications of Transformers to all this (he is a huge fan of Transformers). But it might be that this was what I mostly remembered.
no subject
Date: 2021-03-01 08:31 am (UTC)https://github.com/anhinga/2020-notes/tree/master/research-agenda
The "DMMs and Transformers" section has not changed much since then, although the change is probably overdue.
It ends with this:
So far, we have been focusing this exploration along two dimensions: * Could what we know about DMMs shed some light on the remarkable properties of Transformers? * What are the ways to incorporate key elements from Transformer architecture into a more flexible DMM setup, and, in particular, could we obtain interesting compact and low training cost models by incorporating attention-inspired and Transformer-inspired motives into DMMs?no subject
Date: 2021-03-01 08:32 am (UTC)There were two more conferences in July, so some time was spent on other matters:
Applied Category Theory Conference: July 5-10, https://dmm.dreamwidth.org/30203.html
Judging by my comments to that post, I liked quite a bit of it back then, but I don't remember much now :-( Math is often like this; it is tempting, but then you don't study in that particular direction and you mostly forget (a more active work is necessary to form a long-term relation with the material).
Julia Conference (online, free): July 24-31, https://dmm.dreamwidth.org/30245.html
That was important, especially the SciML 4-hour tutoral ( https://twitter.com/ComputingByArts/status/1287501509127811073 ), but also a number of other things.
no subject
Date: 2021-03-02 07:02 am (UTC)"The fact that inexpensive transfer learning or inexpensive fine-tuning is possible in many machine learning models such as neural nets and transformers implies that implicit metalearning on the level of model is happening during their training or pretraining."
GPT-3 seems to do this even more, in a style which somewhat resembles MAML (Model-Agnostic Meta-Learning), although it was not planned this way, but the authors are trying to explain this effect.
no subject
Date: 2021-03-02 07:08 am (UTC)I also made a slide deck and gave a talk on Sep 3, "Higher-order neuromorphic computations with linear streams": https://github.com/anhinga/2020-notes/tree/master/CCC-2020 (particularly emphasizing mathematical side of the situation and recalling the material from our 2013-2015 research)
I ended up recording two most important papers of those which I've read here, both on Efficient Transformers:
Sep 27: "Selected less known papers on Transformers": https://github.com/anhinga/2020-notes/blob/master/attention-based-models/selected-papers.md
no subject
Date: 2021-03-02 07:17 am (UTC)Here I am reviving my June 2019 preprint on duality between the matrix network weights and the matrix of input vectors in DMMs. Back in 2019 it was "some theory work to revisit in the future", but Transformer's focus on matrix multiplication made me to move it to the forefront. Suddenly it started to look as potentially the most promising direction to explore.
Now we are at about 4 months since GPT-3 revolution.
no subject
Date: 2021-03-02 07:30 am (UTC)It involved playing with the idea of a machine built around bilinear operation of matrix multiplication interleaved with non-linear transform.
In particular, the first experiments interpreting monochrome images as matrices and multiplying them as matrices were done in October (this was done in Julia; I was playing with Julia till late May, then I stopped, and switched to thinking about Transformers, in October I resumed playing with Julia).
In December I reproduced those experiments inside Jupyter notebook (my first Jupyter notebook in Julia), and I committed the Jupyter notebook version.
So, this was not too fast: Oct-Nov was a slow time of thinking and software explorations, and by the end of November we find ourselves at 6 months since GPT-3 revolution.
no subject
Date: 2021-03-02 07:39 am (UTC)"Machines with matrix multiplication as a key element", https://github.com/anhinga/2020-notes/blob/master/attention-based-models/matrix-mult-machines.md
Dec 8: committed that Julia Jupyter notebook to https://github.com/anhinga/github-tests
(Feb 2: Made a blog post about this notebook: "Умножая монохромные картинки как матрицы": https://dmm.dreamwidth.org/36512.html )
Jan 13-Jan 15: created https://github.com/anhinga/julia-notebooks/tree/main/images-as-matrices and did a scaling study to make sure that the observed effects are robust with respect to scaling, published that study as a Julia Jupyter notebook in images-as-matrices directory on GitHub.
no subject
Date: 2021-03-02 07:50 am (UTC)And also here JAX comes into the picture and will exist in it on par with Julia.
November 30: DeepMind announces that it has "solved" the problem of protein folding (it seems to be solved in the sense that the performance on par with best human labs, but much faster time-wise, seems to have been reached; however, it is not clear how to start addressing the question whether this system can also solve protein folding problems which humans can't solve at all: how do we even start testing that?)
In any case, the system, Alpha Fold 2, is a hybrid model with attention at its center, so my prediction that hybrids with attention are the future seems to be working well: https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
December 4: DeepMind publishes a blog post describing DeepMind JAX ecosystem: https://deepmind.com/blog/article/using-jax-to-accelerate-our-research
This made me realize by late December that JAX and Julia Flux are of comparable flexibility: they both are the next generation ultra-flexible machine learning frameworks, they allow to compute gradients of large subsets of Python and Julia (there are some requirements for immutability of arrays in both cases, and so one is encouraged to move towards functional programming with immutable data), they allow to compute gradients with respect to tree-like structures and not just with respect to flat "tensors", and they do a lot to speed things up and to allow interoperability with all kinds of things. People have flamewars over the JAX vs Julia Flux issues, and assert superiority of the one of these two systems they like, but honestly, while these two systems are different, they are pretty competitive with each other; the trade-offs are rather complicated.
no subject
Date: 2021-03-02 07:54 am (UTC)There is nothing "universal" about them, but the parts of the system of differential equations which are not well-known are replaced by modest feed-forward neural nets (not unlike the connectors in Transformers), and the word "universal" comes from the fact that feed-forward neural nets are universal approximators of large classes of functions, so if you don't know a function in a particular part of the right-hand-side well, it's cool to replace it with a feed-forward net, and to let the system figure it out, while also finding other parameters of the "neural differential equation" in question.
This way one gets nice compact models with strong biases from the structure of the system of differential equations.
Now, if one replaces "differential equation" with "arbitrary differentiable program" (note that occasional discontinuity and non-smoothness is allowed, as in e.g. ReLU or if), and a feed-forward net with a model or a piece of model one feels like including into one's differentiable program, then one obtains the "setup of differentiable programming".
So, neural nets, other machine learning models, differential equations, matrix multiplications, various pieces of DMMs, all can be included into "differentiable programs".
If one uses DMMs as a "pure formalism" for differentiable programming, the main advantage one gets from that is better metalearning.
In differentiable programming, full-scale metalearning includes program synthesis, and program synthesis is difficult. Here using DMMs as a formalism for differentiable programming should yield advantage.
But one does not have to do this all at once, one can start with a nice differentiable programming system like Julia Flux or JAX/Python and incorporate motives from DMMs into that piecemeal, and eventually consider gradually switching to a more pure DMM-based formalism with better metalearning properties.
no subject
Date: 2021-03-02 08:33 am (UTC)Jan 4: I publish a research draft about this: "Dataflow matrix machines, tree-shaped flexible tensors, neural architecture search, and PyTorch", https://github.com/anhinga/2021-notes/tree/main/research-drafts
Feb 11-17: I upgrade my "interdisciplinary collaborative research agenda", adding information about JAX, and also adding Section 7: " DMMs vs. differential (differentiable) programming: a meta-learning aspect":
https://github.com/anhinga/2021-notes/tree/main/research-agenda
https://www.cs.brandeis.edu/~bukatin/dmm-collaborative-research-agenda.pdf
no subject
Date: 2021-03-02 08:43 am (UTC)I publish "9 months since GPT-3 revolution": https://anhinga-anhinga.livejournal.com/84392.html and there is useful discussion in the comments there.
I publish this blog post, К 9 месяцам с появления GPT-3", and these comments.
So here we are, 9 months after GPT-3 revlution...
(It might be that nothing more needs to be done by people like me, it might be that some smart people elsewhere will make enough breakthroughs in the next few months to "solve AI", and we can just hope that they are thinking well about "AI safety", but we can't really participate. But if not, continuing this line of research should be of interest, and so I am going to continue working on this (I do spend time looking at what people are writing on "AI safety" recently, e.g. this https://dmm.dreamwidth.org/36635.html and some other texts; I think we should at least ponder "AI safety" issues, especially if we are doing work which might turn out to be relevant to "transition to 'true AI' " and which might impact the dynamics and properties of this "transition").)
So, one might want to just experiment with various aspects of DMMs and matrix multiplications and other ideas which are coming to one's mind in the context of DMMs, Transformers, attention-based models, and their interplay, and one should do this within one of the modern ultra-flexible frameworks for differentiable programming, such as Julia Flux or JAX, and this line of exploration has a good chance of being fruitful.