dmm: (Default)
[personal profile] dmm
The main difference between classical Transformers and the new generation (which includes GPT-4) is that the new generation seems to be "mixtures-of-experts", with each of their feedforward layers subdivided into "experts" and only some of its "experts" activated on each inference run.

I think this is key to both GPT-4 and Mixtral 8x7B (which I suspect is approximately an open-source mini-GPT-4 and which is the new leading open-source model roughly equivalent to GPT-3.5 in performance).

Of course, GPT-4 might have some extra magic secret sauce besides being (according to rumors) a "mixture-of-experts" and scale (given how difficult it has been to even reproduce its performance so far).

Hugging Face published a very nice tutorial recently: huggingface.co/blog/moe

Date: 2023-12-16 04:25 pm (UTC)
juan_gandhi: (Default)
From: [personal profile] juan_gandhi

The distinction sounds pretty reasonable.

Profile

dmm: (Default)
Dataflow matrix machines (by Anhinga anhinga)

May 2025

S M T W T F S
    123
456 78910
11 121314151617
18192021222324
25262728293031

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 5th, 2025 08:33 pm
Powered by Dreamwidth Studios