Hi my name is Fazal and I am a freshman at UC Berkeley. The following document is a collection of my learnings/understanding of transformer models. Follow along as I explain the novel architecture and build my own from scratch to really understand the inner workings. This document is meant for people of all levels of expertise, and should only require a curious, thoughtful mind.

With the rise of AI like ChatGPT and Google’s Bard, the race to building the world’s first fully sentient AI is underway. We have to recognize that when trying to train a ChatGPT level chatbot, we have to teach a system how to go from zero knowledge of a language to a system that is “smart” enough to convey ideas thoughtfully. Several major companies are trying their luck at constructing and training their own models with unique datasets, parameters, frameworks, etc. While each of these models are constructed differently, they actually all have the same underlying architecture that allows them to work so well. This novel architecture is known as the transformer.

Transformers are a kind of machine learning model that are used for sequence to sequence modeling. The most simple explanation of a transformer is a model that can take an input sequence, and generate an output sequence. Some examples include language translation or summarization. In both cases, the model expects a collection of words (input sequence) and outputs a new collection of words (output sequence).

With so much innovation happening at this front, I found myself wondering how these transformers actually worked. How can I ask these systems a question about any random topic and get an instantaneous response that is not only accurate, but also given to me in a conversational manner as if I was talking to another person? Let’s dive in.

In the remainder of this document, I will be referring to a transformer used for text to text generation.

What makes transformers better than other models?

In our daily lives, whether we are talking to a friend, solving a math problem, or even microwaving something, we often use our past experiences or memories to help us make informed decisions in the present. We do this without even thinking for the most part; even in linear algebra, we still remember how to factor because we learned how to years ago. When we are deciding what to eat out for dinner, we take into consideration that just a week ago we already had Chinese food. While this idea of extracting relevant thoughts from the past and using them to make judgements is second nature for us, it is not so easy for an AI. How can we tell our model which “memories” it should be using as well as how relevant they even are? Transformers ultimately aim to solve this problem and it is why they are so good at tasks like question answering or summarization. It is the idea that they are able to actually understand and learn from context.

This idea I just described is what is known as attention. In essence, attention allows the model to focus on specific parts of its input sequence when generating the output sequence, which significantly enhances its ability to understand context and generate accurate results.

To understand, let’s use an example of a transformer whose job is to predict the next word in a sentence. Say the input sentence is “I love the colors red and ” and we would like the output sequence to be just “blue.” Attention basically allows the model to look at the sentence it was given and find links between different words. For example, our transformer may learn that “red” is what is known as a “color” and that “and” means the output should include another “color” to complete the thought. While the actual computations are a bit more complex, this simple explanation accurately represents how a transformer learns and logically deduces what the next word may be.

With this example, you may realize how the idea of attention can be powerful especially when given a much longer input sequence. If your transformer receives a paragraph as input, it is able to create more links between words, effectively increasing its understanding of the overall input. This concept is specifically known as self-attention as it involves the transformer looking solely at its input sequence.

Transformer Architecture

Untitled

This image which shows the inner workings of a transformer is by far the most straightforward way to understand how the model works — that is, once you actually know what all the components actually do. At first glance we can see that this diagram contains 2 separate blocks. The one on the left is an encoder and the one on the right is the decoder.

In this document, I will focus more on the decoder and touch only slightly on the encoder. The reason for this is because transformers don’t always even need an encoder; in fact, decoder only transformers excel in generating coherent sequences based on contextual information. Many state of the art models such as OpenAI’s GPT4 happen to be decoder only transformers.

Let’s unpack this diagram and first understand each individual piece one at a time. Remember that all my examples and explanations are in the context of a transformer that is using sentences as input/output sequences. Oftentimes in other tutorials or transformer explanations, you may see them use the word “token.” Keep in mind that “tokens” or “words” are interchangeable and just the parts that make up the sequences that go in and out of the model.

What makes transformers better than other models?

Transformer Architecture

Parts of the Transformer

Embeddings and Positional Encoding

Multi-Head Attention

Masked Multi-Head Attention