A STUDY ON ATTENTION IN MACHINE LEARNING
Attention is scientifically researched across disciplines, including neuroscience, psychology & especially in Machine Learning (ML). While all fields may have introduced their own explanation for attention, there is one central quality we can agree on – a mechanism for creating both artificial & biological neural systems more flexible is Attention.
This blog will focus on the significance of the attention concept to different disciplines & how attention revolutionizes Machine Learning in natural language processing (NLP) & computer vision domains.
The Idea Behind ‘Attention’
Study on the concept of attention leads to its origin from the field of psychology. The scientific research of attention began in the psychology field, where careful behavioral experimentation can offer a rise to real demonstrations of the abilities & tendencies of attention in various circumstances. Observations acquired from such researches could aid scientists to infer the mental procedures underlying such patterns.
Attention in Machine Learning
The idea of attention in Machine Learning is inspired by the psychological mechanisms in the biological brain. The use of attention mechanisms in artificial neural networks evolved much like the apparent requirement for attention in the brain to make neural systems more flexible. These mechanisms in ML permit a single trained artificial neural network (ANN) to perform well on numerous tasks or programs with inputs of variable structure, length, or size.
Its executions don’t track with biological attention, while its spirit is undoubtedly inspired by psychology. The attention mechanisms operate within an encoder-decoder framework & in the sequence model context. The role of the encoder – generate input’s vector representation & decoder’s task – transform this representation into an output. This mechanism links these two.
Attention in Natural Language Processing
An early application for attention in natural language processing was translating an input in a source language to an output in a target language. The encoder will create a context vector set & decode will read these vectors to generate an output. One of the earliest tasks in machine translation that address the obstruction created by fixed-length vectors – Bahdanau et al. (2014) – employing the Recurrent Neural Networks (RNNs) for both encoding & decoding.
Another was by Sutskever et al. (2014) – made use of multi-layered long short-term memory (LSTM) for vector encoding representing the input sequence & vector decoding into target sequence. The most recent is by Vaswani et al. (2017), having an entirely different design steering the machine translation into a new direction.
Attention in Computer Vision
Attention in computer vision has identified its way into numerous applications like image segmentation, classification & captioning domains. Suppose the encoder-decoder prototype has to be reframed to image captioning, for instance. In that case, the encoder can be a convolutional neural network (CNN) that obtains the salient visual in the image into a vector representation & the decoder can either be LSTM/RNN to transform the representation into an output.
In neuroscience literature, these attentional procedures can be divided into feature & spatial-based attention. The earliest work of attention in computer vision was proposed by Dosovitskiy et al. (2020), who applied vision transformers (ViT) to image classification jobs. Arnab et al. (2021) extended the model to ViViT, exploiting the spatiotemporal data contained within the videos. The ViT is applied to various other domains like image generation, action localization & gaze estimation.