dmm | OpenAI code generation breakthrough

In this video Microsoft CTO is interviewing OpenAI CEO starting from 25:00 mark (right before this mark he is talking about a huge computer system Microsoft created for OpenAI; the style of this overall Microsoft video does feel quite weird to my taste, but this fragment with Sam Altman is good):

twitter.com/matvelloso/status/1263193089310461952

At about 29:00 mark OpenAI demos their new transformer-based code-generating system trained on a large subset of GitHub. I'd say, it's quite impressive, it does feel like a breakthrough in coding-assisting tools. Some discussion here:

news.ycombinator.com/item?id=23250379

Generally speaking, people are saying lately that large modern transformer models only pretend to be sequence-to-sequence, but in reality they learn tons of structured linguistic information, see e.g. this informal essay-style paper and references therein:

arxiv.org/abs/2005.06420 "The Unstoppable Rise of Computational Linguistics in Deep Learning"

(This is not yet a artificial junior software engineer one can hire, but this OpenAI prototype is a considerable step in that direction. May 20, 2020 will be remembered as an important milestone.)

Threaded | Top-Level Comments Only

From:

juan_gandhi

~~Wow. That's about the first part (MSFT)~~

Now I feel like it's a bullshit. The guys have a huge code database, with comments (maybe some written by the "data engineering team").
And then they "translate" from English to Python, using that corpus. Then they find several examples that worked, and show them to toe public.

It has nothing to do with programming.

But another wow. That's about computational linguistics. When I talked with Dima Gensel, he was adamant regarding using any linguistics at all, just stats. Well, ok, that was his PhD, so. It kind of worked. Except that it worked after it was repaired, I guess.

Cool, cool.

Edited Date: 2020-05-23 02:33 am (UTC)

dmm

I was skeptical myself, till I've seen that it was Sam Altman who was showing this.

He did not have anything to gain from showing something which was totally dead-end, and a reputation to be damaged. So, based on his personal track record, I think it's real (which does not mean that it is ready-ready for people to use in production; but it's unlikely to be just PR fake either; with OpenAI track record, it'll probably become something ready quite soon).

Yes, my judgement here is based on who Sam Altman is, not just on the demo itself.

Edited Date: 2020-05-23 02:55 am (UTC)

Ok, that's a different story. I don't know who he is.

https://en.wikipedia.org/wiki/Sam_Altman

О, классный какой.

Speaking about structured linguistic information captured by transformers, prominent papers include:

https://arxiv.org/abs/1905.09418 "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" (this paper is led by people from Yandex)

https://arxiv.org/abs/1905.05950 "BERT Rediscovers the Classical NLP Pipeline" (I've seen this one earlier)

https://arxiv.org/abs/1906.02715 "Visualizing and Measuring the Geometry of BERT" (this looks very interesting)

There are also dozens of references to these papers at this point (58, 75, and 19 references respectively, counted by Google Scholar); so there must be further analysis in the literature.

Edited Date: 2020-05-23 03:13 pm (UTC)

Of course, when an imperfect code-generator is used in this fashion, the code has to be easily readable and understandable by a human. Otherwise, it would be impossible to evaluate its quality.

And readable and understandable code is, by itself, a super-important goal - so there is harmony here; this system does generate nice code (achievable by censoring which repositories are admissible as training data).

Meanwhile Microsoft released this thing, which is "officially unrelated" to the OpenAI effort (but there will be an attempt to integrate a version of OpenAI system into IntelliCode in the future): https://arxiv.org/abs/2005.08025 "IntelliCode Compose: Code Generation Using Transformer"

Ah, so there was some functionality in this direction in the IntelliCode, although less spectacular:

https://hub.packtpub.com/microsofts-visual-studio-intellicode-gets-improved-features-whole-line-code-completions-ai-assisted-refactoring-and-more/

***

https://microsoft.github.io/prose/

https://www.microsoft.com/en-us/research/publication/on-the-fly-synthesis-of-edit-suggestions/

https://devblogs.microsoft.com/visualstudio/ai-assisted-developer-tools/

( https://twitter.com/amandaksilver/status/1191573487191838722 )

Edited Date: 2020-06-19 02:21 am (UTC)

This (May 22) was the first time when I told myself that it's time to relax in terms of "trying to bring DMMs to practice as soon as possible", because this milestone makes it likely, that "enough advances towards AGI will be made quite soon".

Today I revisited this point (I was not being sufficiently relaxed in this sense recently, despite observing this May 22 milestone). So, I revisited it, and I decided that I should really drop my "self-imposed obligation to push DMM advances as hard as possible".

I should go back to the "free research mode" which is more natural for me.

Dataflow matrix machines should just be one of the things I am doing (it is already so, effectively anyway), and I should do it only to the extent I feel like it, and only in the directions which feel attractive to me at a given moment.

***

Rich Hickey in his essay "Open Source is Not About You"

https://gist.github.com/richhickey/1563cddea1002958f96e7ba9519972d9

"Just because someone open sources something does not imply they owe the world a change in their status, focus and effort, e.g. from inventor to community manager."

"Open source is a no-strings-attached gift, and all participants should recognize it as such."

So, it would be wrong to think that just because I created DMMs and the body of DMM-related research, papers, and code, I therefore owe it morally to anyone (including myself) to further push hard in this direction.

I was allowing this situation with DMMs and their potential to attach strings and constraints on me, and I should be free to liberate myself from those attachments.

***

I should also be more neutral in the sense of Paul Graham essay "Keep Your Identity Small":

http://www.paulgraham.com/identity.html

"The most intriguing thing about this theory, if it's right, is that it explains not merely which kinds of discussions to avoid, but how to have better ideas. If people can't think clearly about anything that has become part of their identity, then all other things being equal, the best plan is to let as few things into your identity as possible."

(E.g. regarding the AI timeline, it makes sense not to have strong opinions on that as a part of my core identity. Basically, it makes sense not to be too attached to any outcome here.)

7 weeks later: 95, 144, and 32 references respectively.

Two and a half months further down the road: 137, 202, and 41 references respectively.

4 more months: 196, 297, and 58 references.

The ""Language Models are Few-Shot Learners" itself: 591 references (I expected more at this point).

Dataflow matrix machines (by Anhinga anhinga)

OpenAI code generation breakthrough

OpenAI code generation breakthrough

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

September 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags