Transformers are a very new topic in machine learning, primarily used for natural language processing, but they are now being used for all sorts of tasks, like computer vision. First introduced in 2017 in the paper Attention Is All You Need, Instead of mimicking the human brain, it focusses on a concept called attention, which, in short, takes into account the importance of each word in a sentence to its meaning. In other words, it prioritizes the most meaningful words in a sentence over those that don’t provide as much insight into the sentence’s meaning. In this chapter, we will cover the basic concept in transformers, how to use Tensorflow and Hugging Face to implement Transformers into our code, and finally using Vision Transformers (ViTs) for computer vision.
The idea of attention is what separates transformers from other models like LSTMs and RNNs. The idea is that different words in a sentence provides more insight into the meaning than others. Another way to think about attention is with this example: when you are reading this sentence, you are reading each word from left to right, but you are remembering keywords in the sentence, which helps you understand the sentence as a whole. Similarly, attention allows a transformer to remember keywords in a sentence, and by consequence, understand the meaning of the entire sentence.
Now you might be asking - how does the transformer decide which words are keywords? That comes from training - most transformers we will use are going to be pretrained, meaning that they already have an understand about relationships between words and how that relationship affects a sentence’s meaning before we even start to train them. Another way that transformers decide key words are through word embeddings - if we apply word embeddings to our dataset, the transformer is able to understand the relationship between different words in a sentence, and hence understand keywords.
Transformers can be used for any natural language processing task, specifically translation, question-answering, text classification, and text generation. Translation and question answering are pretty self-explanatory - text classification and generation, however, might be a little confusing. Text classification is the process of classifying text into different categories - this can be hate speech classification or sentiment analysis, which is classifying what emotion a piece of text conveys - hate, love, happiness, etc. Text generation is creating new sentences, or even essays, from some input - these models can create books and stories based on a variety of authors or genres.
Transformers have also become popular for Computer Vision with the paper A Picture Is Worth16x16 Words, which provides a use case for transformers in this area. You can read a little bit about what these do here: What Are Vision Transformers And How Are They Important For General Purpose Learning? | by J. Rafid Siddiqui, PhD | Towards Data Science. We will learn more about them in a future chapter.
As with most Neural Networks, we will need to classify or create something after our model has trained - this can be predicting the next word in a sentence or translating entire sentences into a different language. When you see most transformer architectures, you might see a little block called MLP, or Multi-Layered Perceptron. This MLP is essentially a regular Dense (Fully-Connected) neural network that classifies the outputs of the transformer into a complete output. This comes in handy when we are dealing with pretrained transformers - by just training our MLP layer, we can take advantage of the hours of training already done by researchers and instead fit the transformer to our needs.5.2 Pretrained Models