6 months since GPT-4 release
Sep. 14th, 2023 11:30 pmA good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:
"Uncovering mesa-optimization algorithms in Transformers"
"Uncovering mesa-optimization algorithms in Transformers"
"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"
"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
no subject
Date: 2023-10-21 01:12 pm (UTC)and specifically to their mesa-layer, at Section 4, "AN ATTENTION LAYER FOR OPTIMAL LEAST-SQUARES LEARNING", on page 5.
This is really, really poorly written (unlike https://arxiv.org/abs/2305.10203 which is quite understandable). To actually decipher what it is doing would be a lot of work. It's even difficult to say whether this would end up a different method from https://arxiv.org/abs/2305.10203, or the same, but differently written...
But thanks to https://arxiv.org/abs/2305.10203 we know that all this is quite possible and should work OK.
***
The moral is that I'll need to traverse https://arxiv.org/abs/2309.05858 to understand the main elements better without getting bogged down in details.
This text is super-interesting, but it's really poorly written, hopefully someone rewrites it at some point.
no subject
Date: 2023-10-21 01:31 pm (UTC)Page 1: They reference plenty of studies showing from various angles that Transformer attention layers secretly perform "large-step steps of gradient descent" on the fly, but they don't mention the pioneering study in this series of informal papers on "physics of Transformers":
https://mcbal.github.io/
no subject
Date: 2023-10-21 01:38 pm (UTC)It is particularly important that this is claimed to work even in small models (reading Matthias Bal one would ask whether this might even work for untrained models, just the function being optimized is probably too weird for untrained models (but mesa-layer defined in this paper or in https://arxiv.org/abs/2305.10203, or other non-standard layers might fix that problem of in-context goal function being too weird)).
no subject
Date: 2023-10-21 01:41 pm (UTC)(Linear attention omits softmax.)
no subject
Date: 2023-10-21 01:55 pm (UTC)It's a great paper, just poorly written, a lot of work to parse the details.
But even a superficial level of understanding of this paper can be quite useful, and one can gradually understand the details better as needed...