"double descent" and "grokking"
May. 30th, 2021 05:58 pmIn the last few years, people discovered that in addition to the traditional machine learning trade-off between underfitting and overfitting, there is often also a good zone "to right of overfitting", the "superoverfitting zone" of very overdefined models is often surprisingly good. This is what "double descent" terminology stands for: the second descent, not to the sweet spot between underfitting and overfitting, but to the right of the "overfitting boundary". This is the so-called "interpolation mode", where training loss is zero, but generalization beyond training data is also pretty good (counter-intuitively).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
This is a good way to explain nice performance of really huge models, but also if one just has a tiny bit of training data, then moderately-sized, but still large models might do pretty well. And it turns out that if the tiny bit of training data is describing the problem precisely, then it might happen that the model finds the precise overall solution to the problem. This is what the authors of a recent paper called "grokking" (following "Stranger in a Strange Land").
What is interesting here is that software engineering tasks often belong to this class. Training data are constraints that tests should pass, and they might be relatively compact, and then the model in question might be able to derive the desired software (although the "grokking paper" (which I reference in the comments) solves a set of much more narrowly defined mathematical problems, and it remains to be seen how general this approach turns out to be).
no subject
Date: 2021-05-30 10:15 pm (UTC)https://mathai-iclr.github.io/
MATH-AI: The Role of Mathematical Reasoning in General Artificial Intelligence
All papers (including this one): https://mathai-iclr.github.io/papers/
I'll write more about some of the papers eventually.