ELMo

How a token is transformed successively over increasing layers of ELMo. At the start, the token is converted to a vector by a linear layer, giving the embedding vector

e_{0}

. In the next layer, a forward LSTM produces a hidden vector

h_{00}

, while a backward LSTM produces another hidden vector

h_{00r}

. In the next layer, the two LSTM produces

h_{10}

and

h_{10r}

, and so on.

ELMo (embeddings from language model) is a word embedding method for representing a sequence of words as a corresponding sequence of vectors.^[1] It was created by researchers at the Allen Institute for Artificial Intelligence,^[2] and University of Washington and first released in February, 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings.

Architecture

ELMo is a multilayered bidirectional LSTM on top of a token embedding layer. The output of all LSTMs concatenated together consists of the token embedding. As the full embedding is too large, it is typically mapped through a trainable linear matrix ("projection matrix") to produce the task-specific embedding.

After the ELMo model is trained, its vector is frozen. The projection matrix is then trained to minimize loss on a specific language task. This is an early example of pretraining.

Comparison

Like BERT (but unlike the word embeddings produced by "bag of words" approaches, and earlier vector approaches such as Word2Vec and GloVe), ELMo embeddings are context-sensitive, producing different representations for words that share the same spelling but have different meanings (homonyms) such as "bank" in "river bank" and "bank balance".^[3]

ELMo's innovation stems from its utilization of bidirectional language models. Unlike their predecessors, these models process language in forward and backwards directions. By considering a word's entire context, bidirectional models capture a more comprehensive understanding of its meaning. This holistic approach to language representation enables ELMo to encode nuanced meanings that might be missed in unidirectional models.^[4]

References

^ Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018). "Deep contextualized word representations". arXiv:1802.05365 [cs.CL].
^ "AllenNLP - ELMo — Allen Institute for AI".
^ "How to use ELMo Embedding in Bidirectional LSTM model architecture?". www.insofe.edu.in. 2020-02-11. Retrieved 2023-04-04.
^ Van Otten, Neri (26 December 2023). "Embeddings from Language Models (ELMo): Contextual Embeddings A Powerful Shift In NLP".

This article related to a type of software is a stub. You can help Wikipedia by expanding it.

[1] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018). "Deep contextualized word representations". arXiv:1802.05365 [cs.CL].

[2] "AllenNLP - ELMo — Allen Institute for AI".

[3] "How to use ELMo Embedding in Bidirectional LSTM model architecture?". www.insofe.edu.in. 2020-02-11. Retrieved 2023-04-04.

[4] Van Otten, Neri (26 December 2023). "Embeddings from Language Models (ELMo): Contextual Embeddings A Powerful Shift In NLP".

[1]

[2]

[3]

[4]