6 months since GPT-4 release
Sep. 14th, 2023 11:30 pmA good way to mark this occasion is to try to read a new paper which seems to be a major breakthrough in understanding and harnessing the magic of Transformers:
"Uncovering mesa-optimization algorithms in Transformers"
"Uncovering mesa-optimization algorithms in Transformers"
"we demonstrate that minimizing a generic autoregressive loss gives rise to a subsidiary gradient-based optimization algorithm running inside the forward pass of a Transformer. This phenomenon has been recently termed mesa-optimization"
"Moreover, we find that the resulting mesa-optimization algorithms exhibit in-context few-shot learning capabilities,
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
independently of model scale. Our results therefore complement previous reports characterizing the
emergence of few-shot learning in large-scale LLMs"
no subject
Date: 2023-10-21 01:38 pm (UTC)It is particularly important that this is claimed to work even in small models (reading Matthias Bal one would ask whether this might even work for untrained models, just the function being optimized is probably too weird for untrained models (but mesa-layer defined in this paper or in https://arxiv.org/abs/2305.10203, or other non-standard layers might fix that problem of in-context goal function being too weird)).