"Meta-Learning Bidirectional Update Rules"
I am reading this paper: arxiv.org/abs/2104.04657
"In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks."
"In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks."
no subject
no subject
"We define a space of possible transformations that specify
the interaction between neurons’ feed-forward and feedback
signals. The matrices controlling these interactions
are meta-parameters that are shared across both layers and
tasks. We term these meta-parameters a “genome”. This
reframing opens up a new, more generalized space of neural
networks, allowing the introduction of arbitrary numbers
of states and channels into neurons and synapses, which
have their analogues in biological systems, such as the multiple
types of neurotransmitters, or chemical vs. electrical
synapse transmission.
Our framework, which we call BLUR (Bidirectional
Learned Update Rules) describes a general set of multi-state
update rules that are capable to train networks to learn new
tasks without ever having access to explicit gradients. We
demonstrate that through meta-learning BLUR can learn
effective genomes with just a few training tasks. Such
genomes can be learned using off-the-shelf optimizers or
evolutionary strategies. We show that such genomes can
train networks on unseen tasks faster than comparably sized
gradient networks. The learned genomes can also generalize
to architectures unseen during the meta-training.
no subject
"Kirsch & Schmidhuber (2020) propose a generalized learning
algorithm based on a sparsely connected set of RNNs
that, similar to our framework, does not use any gradients
or explicit loss function, yet is able to approximate forward
pass and backpropagation solely from forward activations
of RNNs. Our system, in contrast, does not use RNN activations
and explicitly leaves (meta-parametrized) bidirectional
update rules in place."
It's an important piece of homework to compare details of similarities and differences between this paper and Kirsch & Schmidhuber, "Meta Learning Backpropagation And Improving It", https://arxiv.org/abs/2012.14905 (especially, given that the approach by Kirsch & Schmidhuber is somewhat related to the metalearning approach we are proposing to do in DMMs: page 2, section 2.2 and page 3, section A.3 of https://www.cs.brandeis.edu/~bukatin/towards-practical-dmms.pdf )
no subject
No traces of the source code (I'd like to understand their "unroll" better, but no code is shared, and I am not 100% sure about their "unroll", which plays quite a bit of role, but is not explained well).
no subject
"To learn a new type of neural network we need to formally
define the space of possible configurations. Our proposed
space is a generalization of classical artificial neural networks,
with inspiration drawn from biology. For the purpose
of clarity, in this section we modify the notation by
abstracting from the standard layer structure of a neural
network, and instead assume our network is essentially a
bag-of-neurons N of n neurons with a connectivity structure
defined by two functions: “upstream” neurons I(i) \in N
that send their outputs to i, and the set of “downstream”
neurons J(i) \in N that receive the output of i as one of
their inputs. Thus the synapse weight matrix w_ij can encode
separate weights for forward and backward connections."
I was also thinking this way about "superneurons" in DMMs, but I was not thinking about bidirectional weights (although, of course, both w_ij and w_ji could be non-zero and different). Of course, with "superneurons", if one really wants a layer, one can put it inside a single "superneuron".
(If one looks at page 4, their formalism is a bit of a mess, if one needs feed-forward non-zero w_ij and w_ji. In those case, one would need to duplicate both slots. As written, their formalism only works in the absence of simultaneously non-zero w_ij and w_ji.)
no subject
So, synapse-wise, one just has K networks with the same structure, but different weights, but there are also KxK matrices governing the way these K networks are interacting.
no subject
"We propose an additive update for both neurons and
synapses. Note that in order to generalize to backpropagation,
an additive update for the backward pass has to
be replaced with a multiplicative one and applied only
to the second state. Experimentally, we discovered
that both additive and multiplicative updates perform
similarly."
So, as written, their particular system does not literally include the case of traditional backpropagation (they also use backward pass non-linearity, which does not help in this sense). They seem to say: "who cares".
no subject
Formula 7 on page 4 is very interesting, we need to ponder it more.
no subject
"In addition to generalizing existing gradient learning, not
relying on gradients in backpropagation has additional benefits.
For example, the network doesn’t need to have an explicit
notion of a final loss function. The feedback (ground
truth) can be fed directly into the last layer (e.g. by an update
to the second state or simply by replacing the second
state altogether) and the backward pass would take care of
backpropagating it through the layers."
no subject
"Notice that the genome is defined at the level of individual
neurons and synapses and is independent from the network
architecture. Thus, the same genome can be trained for
different architectures and, more generally, genome trained
on one architecture can be applied to a genome with different
architectures. We show some examples of this in the
experimental section.
Since the proposed framework can use more than two states,
we hypothesize that just as the number of layers relates to
the complexity of learning required for an individual task
(inner loop of the meta-learning), the number of states might
be related to complexity of learning behaviour across the
task (outer loop)."
no subject
page 6: using TensorFlow and JAX; between 30 min and 20 hours training on single GPU to meta-learn the genome
no subject
ablation study on the symmetry of synapses: it's interesting that the setup closest in spirit to backprop and initialized to resemble backprop performs slightly better (although, they all are good and much better than backprop in these experiments).
no subject
Learning genomes for deeper and wider networks (Fig 10)
"genomes generalize from more complex architectures to less complex architectures, but not vice versa!"
no subject
*** things don't depends on whether we can take derivatives! so one can apply this to systems consisting of mixtures of differentiable and non-differentiable componenets! ***
no subject
"There are many interesting directions for future exploration.
Perhaps, the most important one is the question of scale.
Here one intriguing direction is the connection between
the number of states and the learning capabilities. Another
possible approach is extending the space of update rules,
such as allowing injection of randomness for robustness, or
providing an ability for neurons to self-regulate based on
current state. Finally the ability to extend existing genomes
to produce ever better learners, might help us scale even
further. Another intriguing direction is incorporating the
weight updates on both forward and backward passes. The
former can be seen as a generalization of unsupervised learning,
thus merging both supervised and unsupervised learning
in one gradient-free framework."
no subject
When we see this kind of bootstrap ("eating one's own dog food"), this will be the sign that the field of metalearnng is maturing.
no subject
page 14: exploring different types of non-linearities, and different numbers of states per neuron (more states is better)
pages 15-16: further experiments