"Mixture-of-Experts" and Transformers
Dec. 16th, 2023 07:18 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
The main difference between classical Transformers and the new generation (which includes GPT-4) is that the new generation seems to be "mixtures-of-experts", with each of their feedforward layers subdivided into "experts" and only some of its "experts" activated on each inference run.
I think this is key to both GPT-4 and Mixtral 8x7B (which I suspect is approximately an open-source mini-GPT-4 and which is the new leading open-source model roughly equivalent to GPT-3.5 in performance).
Of course, GPT-4 might have some extra magic secret sauce besides being (according to rumors) a "mixture-of-experts" and scale (given how difficult it has been to even reproduce its performance so far).
Hugging Face published a very nice tutorial recently: huggingface.co/blog/moe
I think this is key to both GPT-4 and Mixtral 8x7B (which I suspect is approximately an open-source mini-GPT-4 and which is the new leading open-source model roughly equivalent to GPT-3.5 in performance).
Of course, GPT-4 might have some extra magic secret sauce besides being (according to rumors) a "mixture-of-experts" and scale (given how difficult it has been to even reproduce its performance so far).
Hugging Face published a very nice tutorial recently: huggingface.co/blog/moe
no subject
Date: 2023-12-16 03:29 pm (UTC)"Consider a machine learning model (for example, a Transformer) with a dynamically computed mask screening the parts of the model which should currently be inactive. Would it make sense to call such a model "mixture of experts", or not? What would be the reasons for calling it this way, or for avoiding this terminology?"
https://chat.openai.com/share/bebb98d1-aa11-46ad-9311-229ec56f9bba
The answer helped me with my confusion about this distinction.
no subject
Date: 2023-12-16 04:25 pm (UTC)The distinction sounds pretty reasonable.
no subject
Date: 2023-12-16 04:37 pm (UTC)"I am looking at the Mixture of Experts, e.g. at https://en.wikipedia.org/wiki/Mixture_of_experts and I see that it's mostly about evaluating a part of a model instead of the whole model, and which part to evaluate depends on the input, so this is basically a model together with a dynamically computed mask screening the parts of the model which should currently be inactive. But what I don't quite understand is the intuition behind this terminology. Why do people choose to call a model with a dynamically computed mask screening the parts of the model which should currently be inactive a "mixture of experts"? What is the intuition behind such a strange name for the presence of a dynamically computed mask?"
and the generated explanation was not clarifying...
Then I started a new GPT-4 session and asked this more symmetrically formulated question
"Consider a machine learning model (for example, a Transformer) with a dynamically computed mask screening the parts of the model which should currently be inactive. Would it make sense to call such a model "mixture of experts", or not? What would be the reasons for calling it this way, or for avoiding this terminology?"
and then the generated explanation was good...