6 months since GPT-4 release
Sep. 14th, 2023 11:30 pmA good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:
"Uncovering mesa-optimization algorithms in Transformers"
"Uncovering mesa-optimization algorithms in Transformers"
"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"
"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
no subject
Date: 2023-09-15 03:33 am (UTC)https://twitter.com/oswaldjoh/status/1701873029100241241
no subject
Date: 2023-09-15 03:35 am (UTC)ME: There is this definition of "mesa-optimization": "Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer." I wonder what is the etymology of "mesa" here...
GPT-4: The term "mesa-optimization" appears to be derived from the concept of a "mesa," which is a flat-topped elevation or hill with steep sides. The use of "mesa" in the term "mesa-optimizer" can be understood metaphorically to suggest a "higher level" or "platform" atop the foundational structure of a base optimizer. In this context, the "mesa-optimizer" is a model that is trained by the base optimizer and then acts as an optimizer itself, in essence becoming a higher-level system perched on top of the base optimizer.
The terminology comes from the field of machine learning safety research and aims to conceptualize the issues that arise when an optimizer creates another optimizer. One of the concerns with mesa-optimization is that the objectives of the mesa-optimizer may not align perfectly with those of the base optimizer, leading to unintended and potentially unsafe behavior.
The choice of "mesa-" as a prefix likely serves to evoke the idea of layered or hierarchical structures, suggesting a relationship between the base and mesa-optimizer that reflects this structural elevation or nested architecture.
no subject
Date: 2023-09-15 04:32 am (UTC)no subject
Date: 2023-09-15 09:06 am (UTC)Who would have guessed before that gradient descent wold be strangely useful in doing "AI".
no subject
Date: 2023-09-15 01:42 pm (UTC)But this new strange form of gradient descent, where a model performs a bit of gradient descent on the fly during a single forward pass, and, moreover, where no one programmed it to perform this gradient descent, but this capability of performing this "on-the-fly gradient descent" during a single forward pass emerged accidentally in the models which had tons of other good properties (Transformers), that's actually quite weird (and not much in doubt any longer, there are many papers looking at Transformers from various angles in the last few years and finding this kind of effects (flavored a bit differently in each case, depending on the angle of view of a given study)).
And one should start asking: should we put this ability to optimize on the fly during a single forward pass into our models explicitly, rather than merely using what has emerged by accident?
And the authors of this paper are, indeed, starting to do this:
"Motivated by our findings that attention layers are attempting to implicitly optimize internal
objective functions, we introduce the mesa-layer, a novel attention layer that efficiently
solves a least-squares optimization problem, instead of taking just a single gradient step
towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax
self-attention Transformers on simple sequential tasks while offering more interpretability."
no subject
Date: 2023-09-15 01:58 pm (UTC)This is amazing.
no subject
Date: 2023-10-21 01:22 am (UTC)It was presented a couple of weeks ago at ML Collective, and it has exactly one citing paper, namely "Uncovering mesa-optimization algorithms in Transformers".
Moreover, its artificial intention is a least-square solver (I don't know if it's the same as van Osvald's mesa-layer, but I'll try to figure that out).
One remark they made during the talk was that a randomly initialized untrained model with this kind of layer interpolated some data (e.g. sine curves) nicely...
***
Anyway, I'd like to resume studying this material, starting with this "artificial intention" paper, and making some notes here...
no subject
Date: 2023-10-21 01:24 am (UTC)no subject
Date: 2023-10-21 04:27 am (UTC)> One remark they made during the talk was that a randomly initialized untrained model with this kind of layer interpolated some data (e.g. sine curves) nicely...
Yes, this is Section 5.1, page 7
no subject
Date: 2023-10-21 04:31 am (UTC)"We would like to thank Irene, Teresa and Manuel for their
eternal patience, Pebbles for its unshakeable enthusiasm, the
Foxes for their distant support, Ketjow for nothing, Sorin
for making us look good and Matt for all the dancing."
no subject
Date: 2023-10-21 01:12 pm (UTC)and specifically to their mesa-layer, at Section 4, "AN ATTENTION LAYER FOR OPTIMAL LEAST-SQUARES LEARNING", on page 5.
This is really, really poorly written (unlike https://arxiv.org/abs/2305.10203 which is quite understandable). To actually decipher what it is doing would be a lot of work. It's even difficult to say whether this would end up a different method from https://arxiv.org/abs/2305.10203, or the same, but differently written...
But thanks to https://arxiv.org/abs/2305.10203 we know that all this is quite possible and should work OK.
***
The moral is that I'll need to traverse https://arxiv.org/abs/2309.05858 to understand the main elements better without getting bogged down in details.
This text is super-interesting, but it's really poorly written, hopefully someone rewrites it at some point.
no subject
Date: 2023-10-21 01:31 pm (UTC)Page 1: They reference plenty of studies showing from various angles that Transformer attention layers secretly perform "large-step steps of gradient descent" on the fly, but they don't mention the pioneering study in this series of informal papers on "physics of Transformers":
https://mcbal.github.io/
no subject
Date: 2023-10-21 01:38 pm (UTC)It is particularly important that this is claimed to work even in small models (reading Matthias Bal one would ask whether this might even work for untrained models, just the function being optimized is probably too weird for untrained models (but mesa-layer defined in this paper or in https://arxiv.org/abs/2305.10203, or other non-standard layers might fix that problem of in-context goal function being too weird)).
no subject
Date: 2023-10-21 01:41 pm (UTC)(Linear attention omits softmax.)
no subject
Date: 2023-10-21 01:55 pm (UTC)It's a great paper, just poorly written, a lot of work to parse the details.
But even a superficial level of understanding of this paper can be quite useful, and one can gradually understand the details better as needed...