dmm: (Default)
[personal profile] dmm
The main difference between classical Transformers and the new generation (which includes GPT-4) is that the new generation seems to be "mixtures-of-experts", with each of their feedforward layers subdivided into "experts" and only some of its "experts" activated on each inference run.

I think this is key to both GPT-4 and Mixtral 8x7B (which I suspect is approximately an open-source mini-GPT-4 and which is the new leading open-source model roughly equivalent to GPT-3.5 in performance).

Of course, GPT-4 might have some extra magic secret sauce besides being (according to rumors) a "mixture-of-experts" and scale (given how difficult it has been to even reproduce its performance so far).

Hugging Face published a very nice tutorial recently: huggingface.co/blog/moe

This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

dmm: (Default)
Dataflow matrix machines (by Anhinga anhinga)

May 2025

S M T W T F S
    123
456 78910
11 121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 10th, 2025 07:09 pm
Powered by Dreamwidth Studios