dmm | 6 months since GPT-4 release

A good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:

"Uncovering mesa-optimization algorithms in Transformers"

"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"

"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"

Flat | Top-Level Comments Only

From:

dmm

https://arxiv.org/abs/2309.05858

https://twitter.com/oswaldjoh/status/1701873029100241241

From:

dmm

https://chat.openai.com/share/674d5916-1d15-47e6-bee0-ea6b0c16113b

ME: There is this definition of "mesa-optimization": "Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer." I wonder what is the etymology of "mesa" here...

GPT-4: The term "mesa-optimization" appears to be derived from the concept of a "mesa," which is a flat-topped elevation or hill with steep sides. The use of "mesa" in the term "mesa-optimizer" can be understood metaphorically to suggest a "higher level" or "platform" atop the foundational structure of a base optimizer. In this context, the "mesa-optimizer" is a model that is trained by the base optimizer and then acts as an optimizer itself, in essence becoming a higher-level system perched on top of the base optimizer.

The terminology comes from the field of machine learning safety research and aims to conceptualize the issues that arise when an optimizer creates another optimizer. One of the concerns with mesa-optimization is that the objectives of the mesa-optimizer may not align perfectly with those of the base optimizer, leading to unintended and potentially unsafe behavior.

The choice of "mesa-" as a prefix likely serves to evoke the idea of layered or hierarchical structures, suggesting a relationship between the base and mesa-optimizer that reflects this structural elevation or nested architecture.

From:

timelets

Thanks

From:

juan_gandhi

Who would have guessed before that gradient descent wold be strangely useful in doing "AI".

From:

dmm

On one hand, it's not so weird. To expect that optimization with respect to a set of parameters would be important is not so strange. Moreover, "backpropagation" (which, mathematically speaking, is a theorem which says that the time to compute a gradient of a function computed by a directed acyclic graph by using a "reverse mode autodiff algorithm" does not exceed a small constant (say, 4) multiplied by the time needed to compute the function itself, and does not depend on the number of variables (that is, on the number of components in the gradient)) has been rediscovered many, many times between 1970 and 1986, and most of those rediscoveries has been made in context of optimizing neural nets (so people has been expecting gradient descent to be useful for "AI").

But this new strange form of gradient descent, where a model performs a bit of gradient descent on the fly during a single forward pass, and, moreover, where no one programmed it to perform this gradient descent, but this capability of performing this "on-the-fly gradient descent" during a single forward pass emerged accidentally in the models which had tons of other good properties (Transformers), that's actually quite weird (and not much in doubt any longer, there are many papers looking at Transformers from various angles in the last few years and finding this kind of effects (flavored a bit differently in each case, depending on the angle of view of a given study)).

And one should start asking: should we put this ability to optimize on the fly during a single forward pass into our models explicitly, rather than merely using what has emerged by accident?

And the authors of this paper are, indeed, starting to do this:

"Motivated by our findings that attention layers are attempting to implicitly optimize internal
objective functions, we introduce the mesa-layer, a novel attention layer that efficiently
solves a least-squares optimization problem, instead of taking just a single gradient step
towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax
self-attention Transformers on simple sequential tasks while offering more interpretability."

Edited Date: 2023-09-15 01:48 pm (UTC)

From:

juan_gandhi

This is amazing.

From:

dmm

Let's continue. Actually, there is a paper "Exploring the Space of Key-Value-Query Models with Intention", https://arxiv.org/abs/2305.10203

It was presented a couple of weeks ago at ML Collective, and it has exactly one citing paper, namely "Uncovering mesa-optimization algorithms in Transformers".

Moreover, its artificial intention is a least-square solver (I don't know if it's the same as van Osvald's mesa-layer, but I'll try to figure that out).

One remark they made during the talk was that a randomly initialized untrained model with this kind of layer interpolated some data (e.g. sine curves) nicely...

***

Anyway, I'd like to resume studying this material, starting with this "artificial intention" paper, and making some notes here...

From:

dmm

"Multiplicative Interactions and Where to Find Them", https://openreview.net/forum?id=rylnK6VtDH, ICLR 2020 poster conference paper

From:

dmm

not an error-free text with plenty of bugs and vague things, but not too difficult to understand, and quite useful...

> One remark they made during the talk was that a randomly initialized untrained model with this kind of layer interpolated some data (e.g. sine curves) nicely...

Yes, this is Section 5.1, page 7

From:

dmm

Acknowledgement (Section 7, page 9):

"We would like to thank Irene, Teresa and Manuel for their
eternal patience, Pebbles for its unshakeable enthusiasm, the
Foxes for their distant support, Ketjow for nothing, Sorin
for making us look good and Matt for all the dancing."

From:

dmm

Now, returning to https://arxiv.org/abs/2309.05858, "Uncovering mesa-optimization algorithms in Transformers"

and specifically to their mesa-layer, at Section 4, "AN ATTENTION LAYER FOR OPTIMAL LEAST-SQUARES LEARNING", on page 5.

This is really, really poorly written (unlike https://arxiv.org/abs/2305.10203 which is quite understandable). To actually decipher what it is doing would be a lot of work. It's even difficult to say whether this would end up a different method from https://arxiv.org/abs/2305.10203, or the same, but differently written...

But thanks to https://arxiv.org/abs/2305.10203 we know that all this is quite possible and should work OK.

***

The moral is that I'll need to traverse https://arxiv.org/abs/2309.05858 to understand the main elements better without getting bogged down in details.

This text is super-interesting, but it's really poorly written, hopefully someone rewrites it at some point.

From:

dmm

OK, rereading this one from the beginning.

Page 1: They reference plenty of studies showing from various angles that Transformer attention layers secretly perform "large-step steps of gradient descent" on the fly, but they don't mention the pioneering study in this series of informal papers on "physics of Transformers":

https://mcbal.github.io/

From:

dmm

Page 2 contains the summary; read it, it is well written, explains the high level...

It is particularly important that this is claimed to work even in small models (reading Matthias Bal one would ask whether this might even work for untrained models, just the function being optimized is probably too weird for untrained models (but mesa-layer defined in this paper or in https://arxiv.org/abs/2305.10203, or other non-standard layers might fix that problem of in-context goal function being too weird)).

Edited Date: 2023-10-21 01:39 pm (UTC)

From:

dmm

Page 3. Just like in https://arxiv.org/abs/2305.10203, it turns to be useful to consider "linear attention" as well.

(Linear attention omits softmax.)

From:

dmm

Page 4. The material here is very interesting.

It's a great paper, just poorly written, a lot of work to parse the details.

But even a superficial level of understanding of this paper can be quite useful, and one can gradually understand the details better as needed...