dmm | 6 months since GPT-4 release

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

A good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:

"Uncovering mesa-optimization algorithms in Transformers"

"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"

"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"

Flat | Top-Level Comments Only

From:

dmm

Now, returning to https://arxiv.org/abs/2309.05858, "Uncovering mesa-optimization algorithms in Transformers"

and specifically to their mesa-layer, at Section 4, "AN ATTENTION LAYER FOR OPTIMAL LEAST-SQUARES LEARNING", on page 5.

This is really, really poorly written (unlike https://arxiv.org/abs/2305.10203 which is quite understandable). To actually decipher what it is doing would be a lot of work. It's even difficult to say whether this would end up a different method from https://arxiv.org/abs/2305.10203, or the same, but differently written...

But thanks to https://arxiv.org/abs/2305.10203 we know that all this is quite possible and should work OK.

***

The moral is that I'll need to traverse https://arxiv.org/abs/2309.05858 to understand the main elements better without getting bogged down in details.

This text is super-interesting, but it's really poorly written, hopefully someone rewrites it at some point.

From:

dmm

OK, rereading this one from the beginning.

Page 1: They reference plenty of studies showing from various angles that Transformer attention layers secretly perform "large-step steps of gradient descent" on the fly, but they don't mention the pioneering study in this series of informal papers on "physics of Transformers":

https://mcbal.github.io/

From:

dmm

Page 2 contains the summary; read it, it is well written, explains the high level...

It is particularly important that this is claimed to work even in small models (reading Matthias Bal one would ask whether this might even work for untrained models, just the function being optimized is probably too weird for untrained models (but mesa-layer defined in this paper or in https://arxiv.org/abs/2305.10203, or other non-standard layers might fix that problem of in-context goal function being too weird)).

Edited Date: 2023-10-21 01:39 pm (UTC)

From:

dmm

Page 3. Just like in https://arxiv.org/abs/2305.10203, it turns to be useful to consider "linear attention" as well.

(Linear attention omits softmax.)

From:

dmm

Page 4. The material here is very interesting.

It's a great paper, just poorly written, a lot of work to parse the details.

But even a superficial level of understanding of this paper can be quite useful, and one can gradually understand the details better as needed...