"double descent" and "grokking"
May. 30th, 2021 05:58 pmIn the last few years, people discovered that in addition to the traditional machine learning trade-off between underfitting and overfitting, there is often also a good zone "to right of overfitting", the "superoverfitting zone" of very overdefined models is often surprisingly good. This is what "double descent" terminology stands for: the second descent, not to the sweet spot between underfitting and overfitting, but to the right of the "overfitting boundary". This is the so-called "interpolation mode", where training loss is zero, but generalization beyond training data is also pretty good (counter-intuitively).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
no subject
Date: 2021-05-30 10:10 pm (UTC)This talk https://www.youtube.com/watch?v=OBCciGnOJVs or this paper https://arxiv.org/abs/1812.11118 by http://misha.belkin-wang.org/
The key passage of the paper:
'All of the learned predictors to the right of the interpolation threshold fit the training data
perfectly and have zero empirical risk. So why should some - in particular, those from richer
functions classes - have lower test risk than others? The answer is that the capacity of the function
class does not necessarily reflect how well the predictor matches the inductive bias appropriate for
the problem at hand. For the learning problems we consider (a range of real-world datasets as well
as synthetic data), the inductive bias that seems appropriate is the regularity or smoothness of
a function as measured by a certain function space norm. Choosing the smoothest function that
perfectly fits observed data is a form of Occam's razor: the simplest explanation compatible with
the observations should be preferred (cf. [38, 6]). By considering larger function classes, which
contain more candidate predictors compatible with the data, we are able to find interpolating
functions that have smaller norm and are thus "simpler". Thus increasing function class capacity
improves performance of classifiers.'
And this follow-up:
https://openai.com/blog/deep-double-descent/