Transformer Architecture

The transformer architecture is a deep learning architecture responsible for the recent dramatic advances in natural language processing and other forms of AI.

What is it about the structure of large language models that has contributed to the incredible advances in AI?


Key Points:

  • The transformer architecture marked a turning point in the development of large language models and the architecture still forms the basis of state-of-the-art models.
  • Transformers broke new ground by exclusively employing a distinctive “self-attention” mechanism as a way of capturing relationships between words in a text.
  • Understanding precisely why the transformer architecture is so powerful remains an active area of AI research, with new methods being developed to investigate these revolutionary systems.

The transformer architecture is a neural network architecture first introduced in 2017 that lies at the heart of possibly all major large language models in use today. While the architecture is not only used to construct language models, understanding in broad terms how transformers process linguistic data is important for understanding how the recent AI boom in language model technology has come about.

A language model, broadly speaking, is a computational system that assigns probabilities to words on the basis of some input text, where these probabilities concern how likely it is that the word appears at a particular point in the text. So a language model that works well might tell you that “the cat is on the” is very likely to be followed by “mat”. A very simple language model will make this prediction based on a small amount of information e.g. a bi-gram model will just predict the next word on the basis of the last word. To get better at this task, it is a good idea to introduce very sophisticated representations of each word that capture intricate details about where it tends to turn up in relation to other words. We should also take into account more than just the previous word, increasing our “context window” to take in much or all of the previous text. Neural network language models prior to the invention of the transformer succeeded in both of these improvements. But an ideal language model would also be sensitive to the relationships that hold between words. After all, to understand what comes next in “The leader of the opposition in the UK is”, we would need to understand the particular relationships between “leader”, “opposition”, and “UK” in that text.

One aspect of understanding the success of the transformer architecture is understanding how it captures word relations as well as it does. This lies in what are called the self-attention heads. The task of the self-attention heads is to transform the way a given word is represented in a way that is sensitive to the words that surround it. It will do this in the following way. Words are represented as vectors i.e. as ordered lists of real numbers. As such, they can be transformed through arithmetical operations e.g. addition and subtraction. To generate a new vector that takes into account the surrounding words, we could just calculate a weighted sum of all word-vectors. For example, to produce a vector for “mat” that captures that it turned up after “the cat is on the”, we take the vectors for all those words and combine them in a weighted sum. But what determines the weightings? This is where self-attention plays a role. Each self-attention head has a way of assigning a score to each other word based on its relationship to the target word (i.e. “mat”). For instance, imagine a self-attention head that is particularly focused on determiner phrases (such as “the mat”). In that case, this head might assign a high attention score to the second “the” in our earlier example, but a lower attention score to “cat”. These scores then feed into the weighted sum and allow some words to have much greater influence on the resultant vector than others.

Throughout this procedure, the vectors are run through a number of distinct transformations, and how these transformations work is completely determined through the model’s training phase. As such, the training will determine what the self-attention head focuses on and how it transforms the vectors that it receives.

In any transformer model there are many layers of many self-attention heads so this procedure is repeated many times. As each self-attention head is able to focus on just certain aspects of the input text, this allows the model as a whole to become sensitive to a wide array of linguistic features, including syntactic, semantic, pragmatic features, and much more besides.

Transformer models suffer from the same kind of opacity that all large neural networks suffer from insofar as we cannot read exactly how they process any particular input from the potentially billions of parameters that constitute the model. That said, as the inventors of the architecture noted, the layers of self-attention heads within transformers make for a discernible structure that may be suitable for interpretation (Vaswani et al. 2017). For example, Manning et al. (2020) identified specific self-attention heads within the language model BERT that, when processing a noun that is the direct object of a verb, will assign a high attention score to the verb. Producing new ways of interpreting how transformer models process particular inputs is currently a thriving area of research within artificial intelligence. But in investigating this, a host of further important questions are raised. Can transformer models provide new insight into linguistic phenomena? Could they even provide insight into human cognition? And could transformer models themselves be capable of communication or cognition? However such questions are answered, the coming years promise to be an exciting time of theoretical, empirical, and engineering progress within artificial intelligence.