dmm | "Mixture-of-Experts" and Transformers

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

The main difference between classical Transformers and the new generation (which includes GPT-4) is that the new generation seems to be "mixtures-of-experts", with each of their feedforward layers subdivided into "experts" and only some of its "experts" activated on each inference run.

I think this is key to both GPT-4 and Mixtral 8x7B (which I suspect is approximately an open-source mini-GPT-4 and which is the new leading open-source model roughly equivalent to GPT-3.5 in performance).

Of course, GPT-4 might have some extra magic secret sauce besides being (according to rumors) a "mixture-of-experts" and scale (given how difficult it has been to even reproduce its performance so far).

Hugging Face published a very nice tutorial recently: huggingface.co/blog/moe

Flat | Top-Level Comments Only

From:

dmm

I've also asked GPT-4 the following question:

"Consider a machine learning model (for example, a Transformer) with a dynamically computed mask screening the parts of the model which should currently be inactive. Would it make sense to call such a model "mixture of experts", or not? What would be the reasons for calling it this way, or for avoiding this terminology?"

https://chat.openai.com/share/bebb98d1-aa11-46ad-9311-229ec56f9bba

The answer helped me with my confusion about this distinction.

From:

juan_gandhi

The distinction sounds pretty reasonable.

From:

dmm

I think so. But it took two attempts. At first, I asked

"I am looking at the Mixture of Experts, e.g. at https://en.wikipedia.org/wiki/Mixture_of_experts and I see that it's mostly about evaluating a part of a model instead of the whole model, and which part to evaluate depends on the input, so this is basically a model together with a dynamically computed mask screening the parts of the model which should currently be inactive. But what I don't quite understand is the intuition behind this terminology. Why do people choose to call a model with a dynamically computed mask screening the parts of the model which should currently be inactive a "mixture of experts"? What is the intuition behind such a strange name for the presence of a dynamically computed mask?"

and the generated explanation was not clarifying...

Then I started a new GPT-4 session and asked this more symmetrically formulated question

"Consider a machine learning model (for example, a Transformer) with a dynamically computed mask screening the parts of the model which should currently be inactive. Would it make sense to call such a model "mixture of experts", or not? What would be the reasons for calling it this way, or for avoiding this terminology?"

and then the generated explanation was good...