dmm | Compact Transformers

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

For those of us (like myself) who'd like to experiment with changing Transformer architecture on a home personal computer.

Links are in the comments.

Flat | Top-Level Comments Only

From:

dmm

Let's now scale it to bring the max magnitude closer (so that we don't have to figure out the best learning rates again):

Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...

(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)

[Epoch 200] Top-1 85.64 Time: 60.00
Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64

Edited Date: 2021-08-25 03:04 am (UTC)

From:

dmm

Trying a mixed of original Transformer and the new formula, "F.softmax(x, dim=1)+x":

[Epoch 200] Top-1 88.47 Time: 70.49
Script finished in 70.49 minutes, best top-1: 88.49, final top-1: 88.47

We got some very mild improvement

From:

dmm

I decided to take "signed normalization" considerations into account ( https://github.com/anhinga/JuliaCon2021-poster/tree/main/signed-normalization ) and try "x-F.softmax(x, dim=1)" instead.

Got a much less stable training curve (I think) getting ahead and falling behind the previous one all the time, but it ended somewhat behind:

[Epoch 200] Top-1 88.12 Time: 58.26
Script finished in 58.26 minutes, best top-1: 88.20, final top-1: 88.12

I am going to rerun (I'd like to check whether the current setup is deterministic, and if not, what are the variations).

From:

dmm

Not deterministic, that's for sure... So one run might not mean all that much...

From:

dmm

Now this is way better:

[Epoch 200] Top-1 88.67 Time: 58.07
Script finished in 58.07 minutes, best top-1: 88.67, final top-1: 88.67

But comparing in the presence of this much jitter is a nightmare, unless one configuration is overwhelmingly better.

One might need to do tons of reruns (in parallel, perhaps) to get statistics...