dmm | Compact Transformers

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

For those of us (like myself) who'd like to experiment with changing Transformer architecture on a home personal computer.

Links are in the comments.

Flat | Top-Level Comments Only

From:

dmm

So, here is my first attempt to test my conjecture that softmax cross-normalization of values is fruitful.

With the modified line 207

            x = torch.matmul(F.softmax(self.attention_pool(x), dim=1).transpose(-1, -2), F.softmax(x, dim=0)).squeeze(-2)

the first run is worse than the baseline:

[Epoch 200] Top-1 84.56 Time: 58.91
Script finished in 58.91 minutes, best top-1: 84.61, final top-1: 84.56

versus baseline

[Epoch 200] Top-1 88.34 Time: 64.41
Script finished in 64.41 minutes, best top-1: 88.34, final top-1: 88.34

Now it's time to ponder this initial negative result.

Edited Date: 2021-08-23 08:33 pm (UTC)

From:

dmm

I am pretty sure that this has bugs related to my lack of understanding of the source code in

https://github.com/SHI-Labs/Compact-Transformers/blob/main/src/utils/transformers.py

Time to understand this better.

***

Yes, these are 3D tensors and "batched matrix multiplication". So there is no reason for my formula to be applicable; I need to change it.

Edited Date: 2021-08-24 11:15 pm (UTC)

From:

dmm

So, the fix is "F.softmax(x, dim=1)" instead of "F.softmax(x, dim=0)", and it works much better, but still trails the unmodified version somewhat...

[Epoch 200] Top-1 86.99 Time: 60.12
Script finished in 60.12 minutes, best top-1: 87.10, final top-1: 86.99

versus baseline

[Epoch 200] Top-1 88.34 Time: 64.41
Script finished in 64.41 minutes, best top-1: 88.34, final top-1: 88.34

Edited Date: 2021-08-25 01:20 am (UTC)

From:

dmm

Let's now scale it to bring the max magnitude closer (so that we don't have to figure out the best learning rates again):

Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...

(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)

[Epoch 200] Top-1 85.64 Time: 60.00
Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64

Edited Date: 2021-08-25 03:04 am (UTC)

From:

dmm

Trying a mixed of original Transformer and the new formula, "F.softmax(x, dim=1)+x":

[Epoch 200] Top-1 88.47 Time: 70.49
Script finished in 70.49 minutes, best top-1: 88.49, final top-1: 88.47

We got some very mild improvement

From:

dmm

I decided to take "signed normalization" considerations into account ( https://github.com/anhinga/JuliaCon2021-poster/tree/main/signed-normalization ) and try "x-F.softmax(x, dim=1)" instead.

Got a much less stable training curve (I think) getting ahead and falling behind the previous one all the time, but it ended somewhat behind:

[Epoch 200] Top-1 88.12 Time: 58.26
Script finished in 58.26 minutes, best top-1: 88.20, final top-1: 88.12

I am going to rerun (I'd like to check whether the current setup is deterministic, and if not, what are the variations).

From:

dmm

Not deterministic, that's for sure... So one run might not mean all that much...

From:

dmm

Now this is way better:

[Epoch 200] Top-1 88.67 Time: 58.07
Script finished in 58.07 minutes, best top-1: 88.67, final top-1: 88.67

But comparing in the presence of this much jitter is a nightmare, unless one configuration is overwhelmingly better.

One might need to do tons of reruns (in parallel, perhaps) to get statistics...