Attention (machine learning)

The machine learning-based attention method simulates how human attention works by assigning varying levels of importance to different components of a sequence. In natural language processing, this usually means assigning different levels of importance to different words in a sentence. It assigns importance to each word by calculating "soft" weights for the word's numerical representation, known as its embedding, within a specific section of the sentence called the context window to determine its importance. The calculation of these weights can occur simultaneously in models called transformers, or one by one in models known as recurrent neural networks. Unlike "hard" weights, which are predetermined and fixed during training, "soft" weights can adapt and change with each use of the model.

Attention was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.

Earlier uses attached this mechanism to a serial recurrent neural network's language translation system (below), but later uses in transformers' large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.

History

See ^[1]^[2] for reviews.

Predecessors

The human selective attention had been studied in neuroscience and cognitive psychology.^[3]

Selective attention of audition was studied in the cocktail party effect (Colin Cherry, 1953)^[4]. (Donald Broadbent, 1958) proposed the filter model of attention.^[5] Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.^[6]

These researches inspired algorithms, such as a variant of the Neocognitron.^[7]^[8] Conversely, developments in neural networks had inspired circuit models of biological visual attention.^[9]^[2] As an example, one well-cited network from 1998 was inspired by the low-level primate visual system, and it produced saliency maps of images using handcrafted (not learned) features, which was then used to guide another neural network to process patches of the image in order of reducing saliency.^[10]

A key aspect of attention mechanism can be written (schematically) as $\sum _{i}\langle ({\text{query}})_{i},({\text{key}})_{i}\rangle ({\text{value}})_{i}$ where the angled brackets denote dot product. This shows that it involves a multiplicative operation. Multiplicative operations within neural networks had been studied under the names of higher-order neural networks,^[11] multiplication units,^[12] sigma-pi units,^[13] fast weight controllers,^[14] and hyper-networks.^[15]

Recurrent attention

During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding.^[1]

In machine translation, the seq2seq model, as it was proposed in 2014,^[16] would encode an input text into a fixed-length vector, which would then be decoded into an output text. If the input text is long, the fixed-length vector would be unable to carry enough information for accurate decoding. An attention mechanism was proposed to solve this problem.

An image captioning model was proposed in 2015, citing inspiration from the seq2seq model.^[17] that would encode an input image into a fixed-length vector. (Xu et al 2015),^[18] citing (Bahdanau et al 2014)^[19], applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder processes the sequence token-by-token. The decomposable attention^[20] attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" ("alignment" is the terminology used by (Bahdanau et al 2014)^[19]). This allowed parallel processing.

The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers^[21] and neural Turing machines^[22]. It was termed intra-attention^[23] where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were combined in the Transformer architecture, published in Attention Is All You Need (2017). Subsequently, attention mechanisms were extended within the framework of Transformer architecture, which is described there.

Dot-product attention

Dot-product attention, popularized by the Transformer architecture, is the most widely used attention mechanism. It can be constructed by adding in mechanisms module-by-module.

For notational cleanness, all vectors are row-vectors.

seq2seq machine translation

Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control <end>", which should translate to "la zone de contrôle international <end>". Here, we use the special <end> token as a control character to delimit the end of input for both the encoder and the decoder.

An input sequence of text $x_{0},x_{1},\dots$ is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors $h_{0},h_{1},\dots$ , where $h$ stands for "hidden vector".

After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence $y_{0},y_{1},\dots$ , autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word:

( $h_{0},h_{1},\dots$ , "<start>") → "la"
( $h_{0},h_{1},\dots$ , "<start> la") → "la zone"
( $h_{0},h_{1},\dots$ , "<start> la zone") → "la zone de"
...
( $h_{0},h_{1},\dots$ , "<start> la zone de contrôle international") → "la zone de contrôle international <end>"

Here, we use the special <start> token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "<end>" appears in the decoder output.

Alignment

The idea of attention mechanism is that if one has limited computing power, such as a person looking at an image, or a decoder decoding a text, one should spend it on the part where it matters. This idea, applied to seq2seq machine translation, is implemented in the cross-attention mechanism.

To begin, consider how the decoder might translate the word "la" when given ( $h_{0},h_{1},\dots$ , "<start>"). Since the word "la" is a direct translation of the English word "the", the decoder should "focus its attention" on the vector $h_{0}$ , with optionally a little attention diverted to other vectors. So, schematically, we can construct a "context vector" $c_{0}$ obtained by a weighted sum of the hidden vectors: $c_{0}=w_{00}h_{0}+w_{01}h_{1}+\cdots$ where we might, for example, hand-craft the weights $w_{00}=0.8,w_{01}=0.2,w_{03}=0.0,\dots$ . These weights are called attention weights.

It is also called "alignment", a term that came from the natural language processing task of "word alignment", which denotes the process of identifying and aligning words or phrases in one language that are equivalent in meaning to words or phrases in another language, typically within a parallel corpus of translated texts. The previous state of the art was the IBM alignment models. "Alignment" was the term used in the (Bahdanau et al, 2014)^[19] paper. For example, for "the zone of international control <end>" → "la zone de contrôle international <end>", we should have the following alignments:

"the" - "la"
"zone" - "zone"
...

Alignment is not trivial, because different languages have different word-order. For this example, "international control" is inverted to "contrôle international". As another example, consider "I love you <end>" → "je t' aime <end>", which we can schematically write as the following attention weight matrix:

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

Here, the "aime" is aligned to "love", inverting the order. This is visually clear as the black top-left to bottom-right diagonal line changes in direction to go from top-right to bottom-left. Sometimes, alignment can be multiple-to-multiple. For example, the English phrase "look it up" corresponds to "cherchez-le". Thus, "soft" attention weights works better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there is no one best hidden vector to use.

Attention

As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder send in a query, and obtain a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key.

The decoder first processes the "<start>" input partially, to obtain an intermediate vector $h_{0}^{d}$ , the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map $W^{Q}$ into a query vector $q_{0}=h_{0}^{d}W^{Q}$ . Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map $W^{K}$ into key vectors $k_{0}=h_{0}W^{K},k_{1}=h_{1}W^{K},\dots$ . The linear maps are useful for providing the model with enough freedom to find the best way to represent the data.

Now, the query and keys are compared by taking dot products: $q_{0}k_{0}^{T},q_{0}k_{1}^{T},\dots$ . Ideally, the model should have learned to compute the keys and values, such that $q_{0}k_{0}^{T}$ is large, $q_{0}k_{1}^{T}$ is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest.

In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over $0,1,\dots$ . This can be accomplished by the softmax function, thus giving us the attention weights: $(w_{00},w_{01},\dots )=\mathrm {softmax} (q_{0}k_{0}^{T},q_{0}k_{1}^{T},\dots )$ This is then used to compute the context vector: $c_{0}=w_{00}v_{0}+w_{01}v_{1}+\cdots$ where $v_{0}=h_{0}W^{V},v_{1}=h_{1}W^{V},\dots$ are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices $W^{Q},W^{K},W^{V}$ , the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same.

This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention".

More succinctly, we can write it as $c_{0}=\mathrm {Attention} (h_{0}^{d}W^{Q},HW^{K},HW^{V})=\mathrm {softmax} ((h_{0}^{d}W^{Q})\;(HW^{K})^{T})(HW^{V})$ where the matrix $H$ is the matrix whose rows are $h_{0},h_{1},\dots$ .

Multi-headed attention

The decoder cross-attention can become "multi-headed" if there are several weight matrices $W^{Q,i},W^{K,i},W^{V,i}$ , and we compute one context vector per weight-matrix-triple: $c_{0}^{i}=\mathrm {Attention} (h_{0}^{d}W^{Q,i},HW^{K,i},HW^{V,i})$ then concatenate them, and apply another linear transformation to obtain a final context vector: $c_{0}=\mathrm {Concat} (c_{0}^{1},c_{0}^{2},\dots )W^{O}$ Multi-headed attention

Self-attention

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences. For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors $h_{0},h_{1},\dots$ . These can then be applied to a dot-product attention mechanism, to obtain ${\begin{aligned}h_{0}'&=\mathrm {Attention} (h_{0}W^{Q},HW^{K},HW^{V})\\h_{1}'&=\mathrm {Attention} (h_{1}W^{Q},HW^{K},HW^{V})\\&\cdots \end{aligned}}$ or more succinctly, $H'=\mathrm {Attention} (HW^{Q},HW^{K},HW^{V})$ . This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.

For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights $w_{ij}=0$ for all $i<j$ , called "causal masking". This attention mechanism is the "causally masked self-attention".

Core calculations

The attention network was designed to identify the highest correlations amongst words within a sentence, assuming that it has learned those patterns from the training corpus. This correlation is captured in neuronal weights through backpropagation, either from self-supervised pretraining or supervised fine-tuning.

The example below (an encoder-only QKV variant of an attention network) shows how correlations are identified once a network has been trained and has the right weights. When looking at the word "that" in the sentence "see that girl run", the network should be able to identify "girl" as a highly correlated word. For simplicity this example focuses on the word "that", but in reality all words receive this treatment in parallel and the resulting soft-weights and context vectors are stacked into matrices for further task-specific use.

The Q_w and K_w sub-networks of a single "attention head" calculate the soft weights, originating from the word "that". (Encoder-only QKV variant).

The sentence is sent through 3 parallel streams (left), which emerge at the end as the context vector (right). The word embedding size is 300 and the neuron count is 100 in each sub-network of the attention head.

The capital letter $X$ denotes a matrix sized 4 × 300, consisting of the embeddings of all four words.
The small underlined letter $x$ denotes the embedding vector (sized 300) of the word "that".
The attention head includes three (vertically arranged in the illustration) sub-networks, each having 100 neurons, being $W q$ , $W k$ and $W v$ their respective weight matrices, all them sized 300 × 100.
$q$ (from "query") is a vector sized 100, $K$ ("key") and $V$ ("value") are 4x100 matrices.
The asterisk within parenthesis " $(*)$ " denotes the $softmax( qW k / \sqrt 100)$ . Softmax result is a vector sized 4 that later on is multiplied by the matrix $V=XW v$ to obtain the context vector.
Rescaling by √100 prevents a high variance in $qW k T$ that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.

Notation: the commonly written row-wise $softmax$ formula above assumes that vectors are rows, which contradicts the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise $softmax$ , resulting in the more correct form

{\textrm {Context}}=(XW_{v})^{T}\times \mathrm {softmax} \left((W_{k}X^{T})\times ({\underline {x}}W_{q})^{T}/{\sqrt {100}}\right)

The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.^[24]

The structure of the input data is captured in the $W q$ and $W k$ weights, and the $W v$ weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query ( $W q$ ), Key ( $W k$ ), and Value ( $W v$ )—a loose and possibly misleading analogy with relational database systems.

Note that the context vector for "that" does not rely on context vectors for the other words; therefore the context vectors of all words can be calculated using the whole matrix $X$ , which includes all the word embeddings, instead of a single word's embedding vector $x$ in the formula above, thus parallelizing the calculations. Now, the softmax can be interpreted as a matrix softmax acting on separate rows. This is a huge advantage over recurrent networks which must operate sequentially.

The common query-key analogy with database queries suggests an asymmetric role for these vectors, where one item of interest (the query) is matched against all possible items (the keys). However, parallel calculations matches all words of the sentence with itself; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning,^[25] masked language tasks,^[26] stripped down transformers,^[27] bigram statistics,^[28] pairwise convolutions,^[29] and arithmetic factoring.^[30]

A language translation example

To build a machine that translates English to French, an attention unit is grafted to the basic Encoder-Decoder (diagram below). In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 trained, fully-connected neural network layers called query, key, and value.

A step-by-step sequence of a language translation.

Encoder-decoder with attention.^[31] The left part (black lines) is the encoder-decoder, the middle part (orange lines) is the attention unit, and the right part (in grey & colors) is the computed data. Grey regions in H matrix and w vector are zero values. Numerical subscripts indicate vector sizes while lettered subscripts i and i − 1 indicate time steps.

Legend
Label	Description
100	Max. sentence length
300	Embedding size (word dimension)
500	Length of hidden vector
9k, 10k	Dictionary size of input & output languages respectively.
x, Y	9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x	300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h	500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s	500-long decoder hidden state vector.
E	500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D	2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary).^[32] The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score	100-long alignment score
w	100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
A	Attention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H	500×100. 100 hidden vectors h concatenated into a matrix
c	500-long context vector = H * w. c is a linear combination of h vectors weighted by w.

Viewed as a matrix, the attention weights show how the network adjusts its focus according to context.^[33]

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

This view of the attention weights addresses the neural network "explainability" problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".

Variants

Many variants of attention implement soft weights, such as

fast weight programmers, or fast weight controllers (1992)^[14]. A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention"^[34].
Bahdanau-style attention,^[33] also referred to as additive attention,
Luong-style attention,^[35] which is known as multiplicative attention,
highly parallelizable self-attention introduced in 2016 as decomposable attention^[23] and successfully used in transformers a year later,
positional attention and factorized positional attention.^[36]

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention,^[37] channel attention,^[38] or combinations.^[39]^[40]

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

1. encoder-decoder dot product	2. encoder-decoder QKV	3. encoder-only dot product	4. encoder-only QKV	5. Pytorch tutorial
Both encoder & decoder are needed to calculate attention.^[35]	Both encoder & decoder are needed to calculate attention.^[41]	Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. w_ij = x_i x_j^[42]	Decoder is not used to calculate attention.^[43]	A fully-connected layer is used to calculate attention instead of dot product correlation.^[44]

Legend
Label	Description
Variables X, H, S, T	Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, T	S, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, H	H, encoder hidden state; X, input word embeddings.
W	Attention coefficients
Qw, Kw, Vw, FC	Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗	⊕, vector concatenation; ⊗, matrix multiplication.
corr	Column-wise softmax(matrix of all combinations of dot products). The dot products are *x_i x_j in variant #3, h_i* s*_j in variant 1, and column _i ( Kw H ) * column _j ( Qw * S ) in variant 2, and column _i ( Kw * X ) * column _j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the $\sqrt d$ where $d$ is the height of the QKV matrices.

Mathematical representation

Standard Scaled Dot-Product Attention

${\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}\right)V$ where $Q,K,V$ are the query, key, and value matrices, $d_{k}$ is the dimension of the keys. Value vectors in matrix $V$ are weighted using the weights resulting from the softmax operation.

Multi-Head Attention

${\text{MultiHead}}(Q,K,V)={\text{Concat}}({\text{head}}_{1},...,{\text{head}}_{h})W^{O}$ where each head is computed as: ${\text{head}}_{i}={\text{Attention}}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ and $W_{i}^{Q},W_{i}^{K},W_{i}^{V}$ , and $W^{O}$ are parameter matrices.

Bahdanau (Additive) Attention

${\text{Attention}}(Q,K,V)={\text{softmax}}(e)V$ where $e=\tanh(W_{Q}Q+W_{K}K)$ and $W_{Q}$ and $W_{K}$ are learnable weight matrices.^[33]

Luong Attention (General)

${\text{Attention}}(Q,K,V)={\text{softmax}}(QW_{a}K^{T})V$ where $W_{a}$ is a learnable weight matrix.^[35]

References

^ ^a ^b Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.
^ ^a ^b Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.
^ Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN 978-0-19-530572-2.
^ Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3. ISSN 0001-4966.
^ Broadbent, D (1958). Perception and Communication. London: Pergamon Press.
^ Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN 0042-6989.
^ Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985. doi:10.1364/AO.26.004985. ISSN 0003-6935.
^ Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23), Multiple Object Recognition with Visual Attention, doi:10.48550/arXiv.1412.7755, retrieved 2024-08-06
^ Koch, Christof; Ullman, Shimon (1987), Vaina, Lucia M. (ed.), "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry", Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience, Dordrecht: Springer Netherlands, pp. 115–141, doi:10.1007/978-94-009-3833-5_5, ISBN 978-94-009-3833-5, retrieved 2024-08-06
^ Itti, L.; Koch, C.; Niebur, E. (November 1998). "A model of saliency-based visual attention for rapid scene analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (11): 1254–1259. doi:10.1109/34.730558.
^ Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972. doi:10.1364/AO.26.004972. ISSN 0003-6935.
^ Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.
^ Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF). Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0.
^ ^a ^b Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
^ Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01), HyperNetworks, doi:10.48550/arXiv.1609.09106, retrieved 2024-08-06
^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].
^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". pp. 3156–3164.
^ Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057.
^ ^a ^b ^c Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19) [first version on arXiv was 1 Sep 2014], Neural Machine Translation by Jointly Learning to Align and Translate, doi:10.48550/arXiv.1409.0473, retrieved 2024-08-05
^ Parikh, Ankur; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016). "A Decomposable Attention Model for Natural Language Inference". Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.18653/v1/d16-1244.
^ Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.
^ Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10), Neural Turing Machines, doi:10.48550/arXiv.1410.5401, retrieved 2024-08-06
^ ^a ^b Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20), Long Short-Term Memory-Networks for Machine Reading, doi:10.48550/arXiv.1601.06733, retrieved 2024-08-06
^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ Zhang, Ruiqi (2024). "Trained Transformers Learn Linear Models In-Context" (PDF). Journal of Machine Learning Research 1-55. 25. arXiv:2306.09927.
^ Rende, Riccardo (2024). "Mapping of attention mechanisms to a generalized Potts model". Physical Review Research. 6 (2): 023057. arXiv:2304.07235. Bibcode:2024PhRvR...6b3057R. doi:10.1103/PhysRevResearch.6.023057.
^ He, Bobby (2023). "Simplifying Transformers Blocks". arXiv:2311.01906 [cs.LG].
^ "Transformer Circuits". transformer-circuits.pub.
^ Transformer Neural Network Derived From Scratch. 2023. Event occurs at 05:30. Retrieved 2024-04-07.
^ Charton, François (2023). "Learning the Greatest Common Divisor: Explaining Transformer Predictions". arXiv:2308.15594 [cs.LG].
^ Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv:1703.03906 [cs.CV].
^ "Pytorch.org seq2seq tutorial". Retrieved December 2, 2021.
^ ^a ^b ^c Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
^ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
^ ^a ^b ^c Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].
^ "Learning Positional Attention for Sequential Recommendation". catalyzex.com.
^ Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv:1904.05873. doi:10.1109/ICCV.2019.00679. ISBN 978-1-7281-4803-8. S2CID 118673006.
^ Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv:1709.01507. doi:10.1109/CVPR.2018.00745. ISBN 978-1-5386-6420-9. S2CID 206597034.
^ Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv:1807.06521 [cs.CV].
^ Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv:2204.04218 [eess.IV].
^ Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.
^ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.
^ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.
^ Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.

External links

Dan Jurafsky and James H. Martin (2022) Speech and Language Processing (3rd ed. draft, January 2022), ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers
Alex Graves (4 May 2020), Attention and Memory in Deep Learning (video lecture), DeepMind / UCL, via YouTube

[:0-1] Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.

[:1-2] Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.

[3] Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN 978-0-19-530572-2.

[Cherry_1953-4] Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3. ISSN 0001-4966.

[Broadbent-5] Broadbent, D (1958). Perception and Communication. London: Pergamon Press.

[6] Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN 0042-6989.

[7] Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985. doi:10.1364/AO.26.004985. ISSN 0003-6935.

[8] Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23), Multiple Object Recognition with Visual Attention, doi:10.48550/arXiv.1412.7755, retrieved 2024-08-06

[9] Koch, Christof; Ullman, Shimon (1987), Vaina, Lucia M. (ed.), "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry", Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience, Dordrecht: Springer Netherlands, pp. 115–141, doi:10.1007/978-94-009-3833-5_5, ISBN 978-94-009-3833-5, retrieved 2024-08-06

[li-10] Itti, L.; Koch, C.; Niebur, E. (November 1998). "A model of saliency-based visual attention for rapid scene analysis". IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (11): 1254–1259. doi:10.1109/34.730558.

[11] Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972. doi:10.1364/AO.26.004972. ISSN 0003-6935.

[12] Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.

[PDP-13] Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF). Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0.

[transform1992-14] Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.

[15] Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01), HyperNetworks, doi:10.48550/arXiv.1609.09106, retrieved 2024-08-06

[sequence-16] Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].

[17] Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". pp. 3156–3164.

[18] Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057.

[:2-19] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19) [first version on arXiv was 1 Sep 2014], Neural Machine Translation by Jointly Learning to Align and Translate, doi:10.48550/arXiv.1409.0473, retrieved 2024-08-05

[parikh2-20] Parikh, Ankur; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016). "A Decomposable Attention Model for Natural Language Inference". Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.18653/v1/d16-1244.

[Graves2016-21] Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.

[22] Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10), Neural Turing Machines, doi:10.48550/arXiv.1410.5401, retrieved 2024-08-06

[parikh-23] Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20), Long Short-Term Memory-Networks for Machine Reading, doi:10.48550/arXiv.1601.06733, retrieved 2024-08-06

[allyouneed-24] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[zhang2024-25] Zhang, Ruiqi (2024). "Trained Transformers Learn Linear Models In-Context" (PDF). Journal of Machine Learning Research 1-55. 25. arXiv:2306.09927.

[rende2023-26] Rende, Riccardo (2024). "Mapping of attention mechanisms to a generalized Potts model". Physical Review Research. 6 (2): 023057. arXiv:2304.07235. Bibcode:2024PhRvR...6b3057R. doi:10.1103/PhysRevResearch.6.023057.

[bhe2023-27] He, Bobby (2023). "Simplifying Transformers Blocks". arXiv:2311.01906 [cs.LG].

[tcircuits-28] "Transformer Circuits". transformer-circuits.pub.

[algosimple-29] Transformer Neural Network Derived From Scratch. 2023. Event occurs at 05:30. Retrieved 2024-04-07.

[charton2023-30] Charton, François (2023). "Learning the Greatest Common Divisor: Explaining Transformer Predictions". arXiv:2308.15594 [cs.LG].

[bdritz2017-31] Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv:1703.03906 [cs.CV].

[pytorch_s2s-32] "Pytorch.org seq2seq tutorial". Retrieved December 2, 2021.

[bahdanau-33] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].

[schlag2021-34] Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.

[xy-dot-35] Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].

[luo-36] "Learning Positional Attention for Sequential Recommendation". catalyzex.com.

[xzhu1-37] Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv:1904.05873. doi:10.1109/ICCV.2019.00679. ISBN 978-1-7281-4803-8. S2CID 118673006.

[jhu1-38] Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv:1709.01507. doi:10.1109/CVPR.2018.00745. ISBN 978-1-5386-6420-9. S2CID 206597034.

[psanghyun1-39] Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv:1807.06521 [cs.CV].

[mgeorgescu-40] Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv:2204.04218 [eess.IV].

[xy-qkv-41] Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.

[xx-dot-42] Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.

[xx-qkv-43] Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.

[pytorch-tutorial-44] Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]