5.2 Pretrained Models

5.2 Pretrained Models


Basic Overview

Training transformers is very, as we say, expensive, meaning that it takes a lot of time and computing power to train one, and we need to train it many times in order to get an optimal accuracy. To save time and energy, we turn to pretrained models - models that have been created and trained by researches with lots of data, meaning that our models is already incredibly accurate and powerful, and we only need to add some finishing touches to fit the model to our needs. These models also usually come with word embeddings, so we don’t need to train a word embeddings model on our data either.

Examples of Pretrained Models

There are many popular examples of pretrained models, some of which have been making the rounds in machine learning news. Some of these examples include:


BERT introduces bidirectionality, which means that it can read from left-to-right and right-to-left at the same time. This allows it to do 2 things: 1) masked language modeling, and 2) next sentence prediction. Masked language modeling is where you cover a word in a sentence, and the transformer has to predict the missing word based on the context. Bidirectionality is useful in this case because it is able to read the sentence in both directions, giving it the full context, whereas only reading the sentence from left to right will only allow it to see the words to the left of our mask, meaning that it doesn’t have the full context. Next sentence prediction isn’t exactly what you may think - it predicts whether 2 sentences have a logical connection and could possibly come one after another.
BERT is special because its approach to language is very “human-like”, helping it solve the problem of NLU. This allows it to solve one of the biggest problems in Natural Language Processing, reading comprehension.


XLNet is essentially a larger and more powerful version of BERT, allowing it to be used for tasks like question-answering and sentiment analysis, in which transformers determine if a sentence is happy, sad, angry, and so on.


Another upgraded BERT, ALBERT, is a lot smaller than BERT but is still very accurate, if not more accurate than BERT. ALBERT’s expertise lies in reading comprehension, specifically in the RACE benchmark, which is similar to SAT reading comprehension tests.


RoBERTa is Facebook’s attempt at BERT - it was trained using the same masking strategy, but some key improvements allow it to be better suited for language tasks. Unlike BERT, it was also trained on news articles, which can help it have a more informational writing style.

Honorable Mention: GPT-3

GPT-3 is a model created by OpenAI that has read the entire internet, meaning that is has a massive knowledge base and is amazing at some very specialized tasks.

Where can we find pretrained models?

In the AI industry, we use website called HuggingFace to find all of our models. Hugging face has a large library of transformers and datasets. For this course, and most likely for any other transformer model building, we will use hugging face to find our pretrained model. We use hugging face because of its massive model base - we can use it to find virtually any model trained on any dataset.

Previous Section

Next Section

Copyright © 2021 Code 4 Tomorrow. All rights reserved. The code in this course is licensed under the MIT License. If you would like to use content from any of our courses, you must obtain our explicit written permission and provide credit. Please contact classes@code4tomorrow.org for inquiries.