6 months since GPT-4 release
Sep. 14th, 2023 11:30 pmA good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:
"Uncovering mesa-optimization algorithms in Transformers"
"Uncovering mesa-optimization algorithms in Transformers"
"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"
"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
no subject
Date: 2023-09-15 09:06 am (UTC)Who would have guessed before that gradient descent wold be strangely useful in doing "AI".
no subject
Date: 2023-09-15 01:42 pm (UTC)But this new strange form of gradient descent, where a model performs a bit of gradient descent on the fly during a single forward pass, and, moreover, where no one programmed it to perform this gradient descent, but this capability of performing this "on-the-fly gradient descent" during a single forward pass emerged accidentally in the models which had tons of other good properties (Transformers), that's actually quite weird (and not much in doubt any longer, there are many papers looking at Transformers from various angles in the last few years and finding this kind of effects (flavored a bit differently in each case, depending on the angle of view of a given study)).
And one should start asking: should we put this ability to optimize on the fly during a single forward pass into our models explicitly, rather than merely using what has emerged by accident?
And the authors of this paper are, indeed, starting to do this:
"Motivated by our findings that attention layers are attempting to implicitly optimize internal
objective functions, we introduce the mesa-layer, a novel attention layer that efficiently
solves a least-squares optimization problem, instead of taking just a single gradient step
towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax
self-attention Transformers on simple sequential tasks while offering more interpretability."
no subject
Date: 2023-09-15 01:58 pm (UTC)This is amazing.