dmm | "Deep Learning for AI" (the 2018 Turing Lecture)

You're viewing

dmm's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

It turns out that this is open-access: dl.acm.org/doi/10.1145/3448250

The nuances in that lecture are very interesting, shed various light in the disagreement between Hinton et al and Schmidhuber et al (this one is written from the Hinton et al side, obviously; their emphasis is that technical aspects are equally important and not subservient to "pioneering theory"; e.g. a lot of rather recent pre-2012 developments such as the practical understanding of the role of ReLU is what made the AlexNet breakthrough possible, and moreover things like "the very efficient use of multiple GPUs by Alex Krizhevsky" are also key, not just the neural architecture ideas).

There is a whole section on Transformers, I am going to include it in the comments verbatim.

The journal publication is July 2021, and there are references in the paper which are newer than 2018; I don't know how heavily the text itself has been edited since 2018.

Flat | Top-Level Comments Only

From:

dmm

"Soft attention and the transformer architecture.

A significant development in deep learning, especially when it comes to sequential processing, is the use of multiplicative interactions, particularly in the form of soft attention. This is a transformative addition to the neural net toolbox, in that it changes neural nets from purely vector transformation machines into architectures which can dynamically choose which inputs they operate on, and can store information in differentiable associative memories. A key property of such architectures is that they can effectively operate on different kinds of data structures including sets and graphs.

..."

From:

dmm

"...

Soft attention can be used by modules in a layer to dynamically select which vectors from the previous layer they will combine to compute their outputs. This can serve to make the output independent of the order in which the inputs are presented (treating them as a set) or to use relationships between different inputs (treating them as a graph).

The transformer architecture, which has become the dominant architecture in many applications, stacks many layers of ”self-attention” modules. Each module in a layer uses a scalar product to compute the match between its query vector and the key vectors of other modules in that layer. The matches are normalized to sum to 1, and the resulting scalar coefficients are then used to form a convex combination of the value vectors produced by the other modules in the previous layer. The resulting vector forms an input for a module of the next stage of computation. Modules can be made multi-headed so that each module computes several different query, key and value vectors, thus making it possible for each module to have several distinct inputs, each selected from the previous stage modules in a different way. The order and number of modules does not matter in this operation, making it possible to operate on sets of vectors rather than single vectors as in traditional neural networks. For instance, a language translation system, when producing a word in the output sentence, can choose to pay attention to the corresponding group of words in the input sentence, independently of their position in the text. While multiplicative gating is an old idea for such things as coordinate transforms and powerful forms of recurrent networks, its recent forms have made it mainstream. Another way to think about attention mechanisms is that they make it possible to dynamically route information through appropriately selected modules and combine these modules in potentially novel ways for improved out-of-distribution generalization.

..."

From:

dmm

"...

Transformers have produced dramatic performance improvements that have revolutionized natural language processing, and they are now being used routinely in industry. These systems are all pre-trained in a self-supervised manner to predict missing words in a segment of text.

Perhaps more surprisingly, transformers have been used successfully to solve integral and differential equations symbolically. A very promising recent trend uses transformers on top of convolutional nets for object detection and localization in images with state-of-the-art performance. The transformer performs post-processing and object-based reasoning in a differentiable manner, enabling the system to be trained end-to-end.

..."