dmm: (Default)
[personal profile] dmm
https://arxiv.org/abs/1801.06105

A remarkable recent Swiss paper finding a simple solution to vanishing gradients problem in recurrent networks.

It is a very simple schema, and it is one of those cases when the question "how comes this was not known for decades?" arises. (Other cases when this question arises include AlphaZero (both Go and Chess) and our own self-modifying neural nets based on vector flows.)

I don't think this is a particularly well written paper - what the authors say is that if one writes the recurrent part H_next = ... + V*H_previous as H_next = ... + (U+I)*H_previous, where U and V are square matrices and I is the identity matrix, then it "encourages the network to stay close to the identity transformation", and then things work nicely, with the added remarkable benefit of making possible to use ReLU activation functions in the recurrent setting without things blowing up. But they don't do a good job explaining why this rewriting encourages the network to stay close to the identity transformation.

I think the answer is regularization, especially explicit regularization on weights like L_2, but possibly also implicit regularization which might be present in some optimization methods. If a regularization encouraging small weights is applied to elements of U, rather than elements of V, then this indeed would encourage the network to stay close to the idenity! (When one scales this kind of network to large data set, one probably needs to make sure that regularization (which is often associated with priors) would not become vanishingly small compared to the influence of the data set, otherwise this approach might stop working.)

(Other than leaving the reader with the sense of mystery for why it all works, the paper is quite interesting and remarkable, both in its results, and in documenting how the authors discovered it. I certainly don't mean to diminish the value of their discovery here.)

Crosspost: https://anhinga-travel.livejournal.com/18925.html
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

dmm: (Default)
Dataflow matrix machines (by Anhinga anhinga)

September 2025

S M T W T F S
 1 23456
78910111213
14151617181920
21222324252627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 29th, 2025 09:05 am
Powered by Dreamwidth Studios