Let's now scale it to bring the max magnitude closer (so that we don't have to figure out the best learning rates again):
Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...
(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)
[Epoch 200] Top-1 85.64 Time: 60.00 Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64
no subject
Date: 2021-08-23 08:32 pm (UTC)With the modified line 207
x = torch.matmul(F.softmax(self.attention_pool(x), dim=1).transpose(-1, -2), F.softmax(x, dim=0)).squeeze(-2)the first run is worse than the baseline:
[Epoch 200] Top-1 84.56 Time: 58.91
Script finished in 58.91 minutes, best top-1: 84.61, final top-1: 84.56
versus baseline
[Epoch 200] Top-1 88.34 Time: 64.41
Script finished in 64.41 minutes, best top-1: 88.34, final top-1: 88.34
Now it's time to ponder this initial negative result.
no subject
Date: 2021-08-24 10:54 pm (UTC)https://github.com/SHI-Labs/Compact-Transformers/blob/main/src/utils/transformers.py
Time to understand this better.
***
Yes, these are 3D tensors and "batched matrix multiplication". So there is no reason for my formula to be applicable; I need to change it.
no subject
Date: 2021-08-25 12:01 am (UTC)[Epoch 200] Top-1 86.99 Time: 60.12
Script finished in 60.12 minutes, best top-1: 87.10, final top-1: 86.99
versus baseline
[Epoch 200] Top-1 88.34 Time: 64.41
Script finished in 64.41 minutes, best top-1: 88.34, final top-1: 88.34
no subject
Date: 2021-08-25 01:54 am (UTC)Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...
(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)
[Epoch 200] Top-1 85.64 Time: 60.00
Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64
no subject
Date: 2021-08-25 05:19 am (UTC)[Epoch 200] Top-1 88.47 Time: 70.49
Script finished in 70.49 minutes, best top-1: 88.49, final top-1: 88.47
We got some very mild improvement
no subject
Date: 2021-09-06 03:30 pm (UTC)Got a much less stable training curve (I think) getting ahead and falling behind the previous one all the time, but it ended somewhat behind:
[Epoch 200] Top-1 88.12 Time: 58.26
Script finished in 58.26 minutes, best top-1: 88.20, final top-1: 88.12
I am going to rerun (I'd like to check whether the current setup is deterministic, and if not, what are the variations).
no subject
Date: 2021-09-06 03:34 pm (UTC)no subject
Date: 2021-09-06 04:58 pm (UTC)[Epoch 200] Top-1 88.67 Time: 58.07
Script finished in 58.07 minutes, best top-1: 88.67, final top-1: 88.67
But comparing in the presence of this much jitter is a nightmare, unless one configuration is overwhelmingly better.
One might need to do tons of reruns (in parallel, perhaps) to get statistics...