dmm | 6 months since GPT-4 release

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

A good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:

"Uncovering mesa-optimization algorithms in Transformers"

"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"

"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"

Flat | Top-Level Comments Only

From:

juan_gandhi

Who would have guessed before that gradient descent wold be strangely useful in doing "AI".

From:

dmm

On one hand, it's not so weird. To expect that optimization with respect to a set of parameters would be important is not so strange. Moreover, "backpropagation" (which, mathematically speaking, is a theorem which says that the time to compute a gradient of a function computed by a directed acyclic graph by using a "reverse mode autodiff algorithm" does not exceed a small constant (say, 4) multiplied by the time needed to compute the function itself, and does not depend on the number of variables (that is, on the number of components in the gradient)) has been rediscovered many, many times between 1970 and 1986, and most of those rediscoveries has been made in context of optimizing neural nets (so people has been expecting gradient descent to be useful for "AI").

But this new strange form of gradient descent, where a model performs a bit of gradient descent on the fly during a single forward pass, and, moreover, where no one programmed it to perform this gradient descent, but this capability of performing this "on-the-fly gradient descent" during a single forward pass emerged accidentally in the models which had tons of other good properties (Transformers), that's actually quite weird (and not much in doubt any longer, there are many papers looking at Transformers from various angles in the last few years and finding this kind of effects (flavored a bit differently in each case, depending on the angle of view of a given study)).

And one should start asking: should we put this ability to optimize on the fly during a single forward pass into our models explicitly, rather than merely using what has emerged by accident?

And the authors of this paper are, indeed, starting to do this:

"Motivated by our findings that attention layers are attempting to implicitly optimize internal
objective functions, we introduce the mesa-layer, a novel attention layer that efficiently
solves a least-squares optimization problem, instead of taking just a single gradient step
towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax
self-attention Transformers on simple sequential tasks while offering more interpretability."

Edited Date: 2023-09-15 01:48 pm (UTC)