Let's now scale it to bring the max magnitude closer (so that we don't have to figure out the best learning rates again):
Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...
(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)
[Epoch 200] Top-1 85.64 Time: 60.00 Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64
no subject
Date: 2021-08-25 01:54 am (UTC)Use "10*F.softmax(x, dim=1)" instead of "F.softmax(x, dim=1)". I am not sure this changes much, but we'll see...
(Perhaps, one does need to play with learning rate schedule; at the beginning it looked like this is better, but then it started to look like this "10*" is counter-productive.)
[Epoch 200] Top-1 85.64 Time: 60.00
Script finished in 60.00 minutes, best top-1: 85.77, final top-1: 85.64
no subject
Date: 2021-08-25 05:19 am (UTC)[Epoch 200] Top-1 88.47 Time: 70.49
Script finished in 70.49 minutes, best top-1: 88.49, final top-1: 88.47
We got some very mild improvement
no subject
Date: 2021-09-06 03:30 pm (UTC)Got a much less stable training curve (I think) getting ahead and falling behind the previous one all the time, but it ended somewhat behind:
[Epoch 200] Top-1 88.12 Time: 58.26
Script finished in 58.26 minutes, best top-1: 88.20, final top-1: 88.12
I am going to rerun (I'd like to check whether the current setup is deterministic, and if not, what are the variations).
no subject
Date: 2021-09-06 03:34 pm (UTC)no subject
Date: 2021-09-06 04:58 pm (UTC)[Epoch 200] Top-1 88.67 Time: 58.07
Script finished in 58.07 minutes, best top-1: 88.67, final top-1: 88.67
But comparing in the presence of this much jitter is a nightmare, unless one configuration is overwhelmingly better.
One might need to do tons of reruns (in parallel, perhaps) to get statistics...