dmm | "double descent" and "grokking"

In the last few years, people discovered that in addition to the traditional machine learning trade-off between underfitting and overfitting, there is often also a good zone "to right of overfitting", the "superoverfitting zone" of very overdefined models is often surprisingly good. This is what "double descent" terminology stands for: the second descent, not to the sweet spot between underfitting and overfitting, but to the right of the "overfitting boundary". This is the so-called "interpolation mode", where training loss is zero, but generalization beyond training data is also pretty good (counter-intuitively).

This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").

What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).

Flat | Top-Level Comments Only

From:

dmm

grokking:

so, one thing "double descent" tells us is why huge models might be very reasonable.

but if one does not have too much data, then very moderately-sized models can be very good and can generalize to exact solutions:

https://twitter.com/ComputingByArts/status/1391287083700981760

in details:

https://twitter.com/yieldthought/status/1392061340940914688

my further comments to it:

"1) The effect in the left image of Figure 1 is quite striking. Figure 4 is also quite remarkable.

2) It might be the case that for precisely defined synthetic tasks the effects tend to be "more pronounced" (and tend to lead to the ability to solve the task exactly). It's premature to make this kind of general pronouncement, especially about the ability to solve the task exactly, but this paper seems to push us to at least consider this kind of conjecture.

3) If 2 is actually true, then one notices that these are the conditions for a typical program synthesis problem (a modestly-sized problem precisely defined by few constraints (expected test results)). So it might be that modestly-sized models (like a small transformer used in this paper) will actually be able to solve these tasks, because the data size is really small, so the desirable "superoverfitting area" is not too far..."

The "Grokking" paper itself (an OpenAI paper, but I don't know much about the authors, the names are new to me):

https://mathai-iclr.github.io/papers/papers/MATHAI_29_paper.pdf#page=1

"Grokking: generalization beyond overfitting on small algorithmic datasets".

Edited Date: 2021-05-30 10:21 pm (UTC)

First Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021.

https://mathai-iclr.github.io/

MATH-AI: The Role of Mathematical Reasoning in General Artificial Intelligence

All papers (including this one): https://mathai-iclr.github.io/papers/

I'll write more about some of the papers eventually.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Dataflow matrix machines (by Anhinga anhinga)

"double descent" and "grokking"

"double descent" and "grokking"

no subject

no subject

no subject

Profile

September 2025

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags