"double descent" and "grokking"
In the last few years, people discovered that in addition to the traditional machine learning trade-off between underfitting and overfitting, there is often also a good zone "to right of overfitting", the "superoverfitting zone" of very overdefined models is often surprisingly good. This is what "double descent" terminology stands for: the second descent, not to the sweet spot between underfitting and overfitting, but to the right of the "overfitting boundary". This is the so-called "interpolation mode", where training loss is zero, but generalization beyond training data is also pretty good (counter-intuitively).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
no subject
This talk https://www.youtube.com/watch?v=OBCciGnOJVs or this paper https://arxiv.org/abs/1812.11118 by http://misha.belkin-wang.org/
The key passage of the paper:
'All of the learned predictors to the right of the interpolation threshold fit the training data
perfectly and have zero empirical risk. So why should some - in particular, those from richer
functions classes - have lower test risk than others? The answer is that the capacity of the function
class does not necessarily reflect how well the predictor matches the inductive bias appropriate for
the problem at hand. For the learning problems we consider (a range of real-world datasets as well
as synthetic data), the inductive bias that seems appropriate is the regularity or smoothness of
a function as measured by a certain function space norm. Choosing the smoothest function that
perfectly fits observed data is a form of Occam's razor: the simplest explanation compatible with
the observations should be preferred (cf. [38, 6]). By considering larger function classes, which
contain more candidate predictors compatible with the data, we are able to find interpolating
functions that have smaller norm and are thus "simpler". Thus increasing function class capacity
improves performance of classifiers.'
And this follow-up:
https://openai.com/blog/deep-double-descent/
no subject
so, one thing "double descent" tells us is why huge models might be very reasonable.
but if one does not have too much data, then very moderately-sized models can be very good and can generalize to exact solutions:
https://twitter.com/ComputingByArts/status/1391287083700981760
in details:
https://twitter.com/yieldthought/status/1392061340940914688
my further comments to it:
"1) The effect in the left image of Figure 1 is quite striking. Figure 4 is also quite remarkable.
2) It might be the case that for precisely defined synthetic tasks the effects tend to be "more pronounced" (and tend to lead to the ability to solve the task exactly). It's premature to make this kind of general pronouncement, especially about the ability to solve the task exactly, but this paper seems to push us to at least consider this kind of conjecture.
3) If 2 is actually true, then one notices that these are the conditions for a typical program synthesis problem (a modestly-sized problem precisely defined by few constraints (expected test results)). So it might be that modestly-sized models (like a small transformer used in this paper) will actually be able to solve these tasks, because the data size is really small, so the desirable "superoverfitting area" is not too far..."
no subject
https://mathai-iclr.github.io/papers/papers/MATHAI_29_paper.pdf#page=1
"Grokking: generalization beyond overfitting on small algorithmic datasets".
no subject
https://mathai-iclr.github.io/
MATH-AI: The Role of Mathematical Reasoning in General Artificial Intelligence
All papers (including this one): https://mathai-iclr.github.io/papers/
I'll write more about some of the papers eventually.