Section 1 - Modeling
We started this course by analyzing a language model as a black box:
Then we looked at the training data of large language models (e.g., The Pile):
In this lecture, we will open up the onion all the way and talk about how large language models are built.
Today’s lecture will focus on two topics, tokenization and model architecture.
-
Tokenization: how a string is split into tokens.
-
Model architecture: We will discuss mostly the Transformer architecture, which is the modeling innovation that really enabled large language models.
Tokenization
Section titled TokenizationRecall that a language model
However, natural language doesn’t come as a sequence of tokens, but as just a string (concretely, sequence of Unicode characters):
A tokenizer converts any string into a sequence of tokens.
This is not necessarily the most glamorous part of language modeling, but plays a really important role in determining how well a model will work.
Split by spaces
Section titled Split by spacesThe simplest solution is to do:
text.split(' ')
- This doesn’t work for languages such as Chinese, where sentences are written without spaces between words:
我今天去了商店。 [gloss: I went to the store.]
-
Then there are languages like German that have long compound words (e.g., Abwasserbehandlungsanlange).
-
Even in English, there are hyphenated words (e.g., father-in-law) and contractions (e.g., don’t), which should get split up. For example, the Penn Treebank splits don’t into do and n’t, a linguistically-informed but not obvious choice.
Therefore, splitting by spaces by spaces to identify words is quite problematic.
What makes a good tokenization?
- We don’t want too many tokens (extreme: characters or bytes), or else the sequence becomes difficult to model.
- We don’t want too few tokens, or else there won’t be parameter sharing between words (e.g., should mother-in-law and father-in-law be completely different)? This is especially problematic for morphologically rich languages (e.g., Arabic, Turkish, etc.).
- Each token should be a linguistically or statistically meaningful unit.
Byte pair encoding
Section titled Byte pair encodingSennrich et al, 2015 applied the byte pair encoding (BPE) algorithm, originally developed for data compression, to produce one of the most commonly used tokenizers.
Learning the tokenizer. Intuition: start with each character as its own token and combine tokens that co-occur a lot.
- Input: a training corpus (sequence of characters).
- Initialize the vocabulary
be the set of characters. - While we want to still grow
: - Find the pair of elements
that co-occur the most number of times. - Replace all occurrences of
with a new symbol . - Add
to .
- Find the pair of elements
Example:
- [t, h, e, ␣, c, a, r], [t, h, e, ␣, c, a, t], [t, h, e, ␣, r, a, t]
- [th, e, ␣, c, a, r], [th, e, ␣, c, a, t], [th, e, ␣, r, a, t] (th occurs 3x)
- [the, ␣, c, a, r], [the, ␣, c, a, t], [the, ␣, r, a, t] (the occurs 3x)
- [the, ␣, ca, r], [the, ␣, ca, t], [the, ␣, r, a, t] (ca occurs 2x)
The output of learning is:
- Updated vocabulary
: [a, c, e, h, t, r, ca, th, the] - The merges that we made (important for applying the tokenizer):
- t, h
th - th, e
the - c, a
ca
- t, h
Applying the tokenizer. To tokenize a new string, apply the merges in the same order:
- [t, h, e, ␣, o, x]
- [th, e, ␣, o, x]
- [the, ␣, o, x]
Unicode.
- One problem is that (especially in the multilingual setting), there are a lot (144,697) of Unicode characters.
- We certainly will not see all characters in the training data.
- In order to reduce data sparsity even further, we can run BPE on bytes instead of Unicode characters (Wang et al. 2019).
- Example in Chinese:
今天 [gloss: today] [x62, x11, 4e, ca]
Unigram model (SentencePiece)
Section titled Unigram model (SentencePiece)Rather than just splitting by frequency, a more “principled” approach is to define an objective function that captures what a good tokenization looks like. We now describe the unigram model (Kudo 2018).
- It was of the tokenizations supported in the SentencePiece tool (Kudo & Richardson, 2018), along with BPE.
- It was used to train T5 and Gopher.
Given a sequence
Example:
- Training data (string):
- Tokenization
( ) - Likelihood:
.
Algorithm:
- Start with a “reasonably big” seed vocabulary
. - Repeat:
- Given
, optimize and using the EM algorithm. - Compute
for each token capturing how much the likelihood would be reduced if were removed from . - Sort by loss and keep the top 80% tokens in
.
- Given
Comparing tokenizers
Section titled Comparing tokenizers- GPT-2 and GPT-3 used BPE, vocabulary size of 50K
- Jurassic used SentencePiece with vocabulary size of 256K
Impact:
- Given the same string, Jurassic requires 28% fewer tokens than GPT-3, so it is 1.4x faster
- Both Jurassic and GPT-3 use the same context size (2048), so one can feed in 39% more text into the prompt.
Examples of tokenizations for both GPT-3 and Jurassic (demo):
- GPT-3: [Ab, raham, ␣Lincoln, ␣lived, ␣at, ␣the, ␣White, ␣House, .]
- Jurassic: [Abraham␣Lincoln, ␣lived, ␣at␣the␣White␣House, .]
Models
Section titled ModelsThus far, we have defined language models as a probability distribution over sequences of tokens
Contextual embeddings. As a prerequisite, the main key development is to associate a sequence of tokens with a corresponding sequence of contextual embeddings:
- As the name suggests, the contextual embedding of a token depends on its context (surrounding words); for example, consider
. - Notation: We will
to be the embedding function (analogous to a feature map for sequences). - For a token sequence
, produces contextual embeddings .
Types of language models
Section titled Types of language modelsWe will broaden our notion of language models to three types of models.
Encoder-only (BERT, RoBERTa, etc.). These language models produce contextual embeddings but cannot be used directly to generate text.
These contextual embeddings are generally used for classification tasks (sometimes boldly called natural language understanding tasks).
- Example: sentiment classification
- Example: natural language inference
- Pro: contextual embedding for
can depend bidirectionally on both the left context ( ) and the right context ( ). - Con: cannot naturally generate completions.
- Con: requires more ad-hoc training objectives (masked language modeling).
Decoder-only (GPT-2, GPT-3, etc.). These are our standard autoregressive language models,
which given a prompt
- Example: text autocomplete
- Con: contextual embedding for
can only depend unidirectionally on both the left context ( ). - Pro: can naturally generate completions.
- Pro: simple training objective (maximum likelihood).
Encoder-decoder (BART, T5, etc.).
These models in some ways can the best of both worlds:
they can use bidirectional contextual embeddings for the input
- Example: table-to-text generation
- Pro: contextual embedding for
can depend bidirectionally on both the left context ( ) and the right context ( ). - Pro: can naturally generate outputs.
- Con: requires more ad-hoc training objectives.
Preliminaries
Section titled PreliminariesWe now describe the innards of the embedding function
We now introduce the model architectures for language model, with an emphasis on the ubiquitous Transformer architecture. Our exposition of the Transformer architecture will be based on these slides from CS221 on differentiable programming, and will depart a bit from the standard presentation.
The beauty of deep learning is being able to create building blocks, just like we build whole programs out of functions. So we want to be able to functions like the following to encapsulate the complexity:
This function will have parameters which we will include in the body but elide in the function signature for simplicity.
In what follows, we will define a library of building blocks until we get to the full Transformer.
Preliminaries
Section titled PreliminariesFirst, we have to convert sequences of tokens into sequences of vectors.
def
- Turns each token
in the sequence into a vector. - Return
.
These are exactly the (context-independent) word embeddings of yore.
We define an abstract
def
- Process each element
in the sequence with respect to other elements. - [abstract implementation (e.g.,
, , )]
The simplest type of sequence model is based on feedforward networks (Bengio et al., 2003) applied to a fixed length context, just as in an n-gram model:
def
- Process each element
in the sequence by looking at the last elements.. - For each
: - Compute
.
- Compute
- Return
.
Recurrent neural networks
Section titled Recurrent neural networksThe first “real” sequence model is a recurrent neural network (RNN), which is a family of models that include simple RNNs, LSTMs, and GRUs. The basic form of an RNN simply computes a sequence of hidden states recursively.
def
- Process the sequence
left-to-right and recursively compute vectors . - For
: - Compute
.
- Compute
- Return
.
The actual module that does the hard work is the
def
- Updates the hidden state
based on a new observation . - [abstract implementation (e.g.,
, , )]
There are three ways to implement the
def
- Updates the hidden state
based on a new observation by simple linear transformation and non-linearity. - Return
.
As defined RNNs only depend on the past, but we can them depend on the future two by running another RNN backwards. These models were used by ELMo and ULMFiT.
def
- Process the sequence both left-to-right and right-to-left.
- Compute left-to-right:
. - Compute right-to-left:
. - Return
.
Notes:
- The simple RNN is difficult to train due to vanishing gradients.
- The Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) (both of
) have been developed to address these. - Still, even though the embedding
can depend arbitrarily far back (e.g., on ), it is unlikely to depend on it in a “crisp” way (see Khandelwal et al., 2018 for more discussion). - LSTMs in some sense were really what brought deep learning into full swing within NLP.
We will not discuss these models in the interest of time.
Transformers
Section titled TransformersNow, we will discuss Transformers (Vaswani et al. 2017), the sequence model that is really responsible for the takeoff of large language models; they are the building blocks of decoder-only (GPT-2, GPT-3), encoder-only (BERT, RoBERTa), and decoder-encoder (BART, T5) models.
There are great resources for learning about the Transformer:
- Illustrated Transformer and Illustrated GPT-2: very nice visual description of the Transformer.
- Annotated Transformer: Pytorch implementation of the Transformer.
You are highly encouraged to read these references. In this lecture, I will strive to take a middle path which emphasizes pseudocode functions and interfaces.
The crux of the Transformers are the attention mechanism, which was developed earlier for machine translation (Bahdananu et al. 2017).
One can think of attention as a “soft” lookup table, where we have a query
We can think of each
and forming the query via another linear transformation:
The key and the query can be compared to give a score:
These scores can be exponentiated and normalized to form a probability distribution over the token positions
Then the final output is a weighted combination over the values:
We can write this all succinctly in matrix form:
def
- Process
by comparing it to each . - Return
.
We can think of there as being multiple aspects (e.g., syntax, semantics) that we would want to match on. To accommodate this, we can simultaneously have multiple attention heads and simply combine their outputs.
def
- Process
by comparing it to each with respect to aspects. - Return
.
Self-attention layer.
Now we will substitute each
def
- Compare each element
to each other element. - Return
.
Feedforward layer. Self-attention allows all the tokens to “talk” to each other, whereas feedforward connections provide:
def
- Process each token independently.
- For
:- Compute
.
- Compute
- Return
.
Improving trainability.
We’re almost done.
We could in principle just take the
Residual connections.
One trick from computer vision is residual connections (ResNet).
Instead of applying some function
we add a residual (skip) connection so that if
Layer normalization. Another trick is layer normalization, which takes a takes a vector and makes sure its elements are too big:
def
- Make each
not too big or small.
We first define an adapter function that takes a sequence model
def
- Safely apply
to . - Return
.
Finally, we can define the Transformer block succinctly as follows:
def
- Process each element
in context. - Return
.
Positional embeddings.
You might have noticed that as defined,
the embedding of a token doesn’t depend on where it occurs in the sequence,
so
To fix this, we add positional information into the embedding:
def
- Add in positional information.
- Define positional embeddings:
- Even dimensions:
- Odd dimensions:
- Even dimensions:
- Return
.
GPT-3. With all the pieces in place, we can now define roughly GPT-3 architecture in one line, just by stacking the Transformer block 96 times:
Shape of the architecture (how the 175 billion parameters are allocated):
- Dimension of hidden state:
- Dimension of the intermediate feed-forward layer:
- Number of heads:
- Context length:
These decisions are not necessarily optimal. Levine et al. 2020 provide some theoretical justification, showing that the GPT-3 is too deep, which motivated the training of a deeper but wider Jurassic architecture.
There are important but detailed differences between different versions of Transformers:
- Layer normalization “post-norm” (original Transformers paper) versus pre-norm (GPT-2), which impacts training stability (Davis et al. 2021).
- Dropout is applied throughout to prevent overfitting.
- GPT-3 uses a sparse Transformer to reduce the number of parameters, interleaving it with dense layers.
- Depending on the type of Transformer (encoder-only, decoder-only, encoder-decoder), different masking operations are used.
- And of course there are many more details involved in the training of Transformer models which we will discuss next time.
Further reading
Section titled Further readingTokenization:
- Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, Samson Tan. 2021. Comprehensive survey of tokenization.
- Neural Machine Translation of Rare Words with Subword Units. Rico Sennrich, B. Haddow, Alexandra Birch. ACL 2015. Introduces byte pair encoding into NLP. Used by GPT-2, GPT-3.
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Yonghui Wu, M. Schuster, Z. Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, M. Krikun, Yuan Cao, Qin Gao, Klaus Macherey, J. Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Y. Kato, Taku Kudo, H. Kazawa, K. Stevens, George Kurian, Nishant Patil, W. Wang, C. Young, Jason R. Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, G. Corrado, Macduff Hughes, J. Dean. 2016. Introduces WordPiece. Used by BERT.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Taku Kudo, John Richardson. EMNLP 2018. Introduces SentencePiece.
Modeling:
-
Language Models are Unsupervised Multitask Learners. Introduces GPT-2.
-
Attention is All you Need. Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. NIPS 2017.
-
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Ofir Press, Noah A. Smith, M. Lewis. 2021. Introduces Alibi embeddings.
-
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. Zihang Dai, Zhilin Yang, Yiming Yang, J. Carbonell, Quoc V. Le, R. Salakhutdinov. ACL 2019. Introduces recurrence on Transformers, relative position encoding scheme.
-
Generating Long Sequences with Sparse Transformers. R. Child, Scott Gray, Alec Radford, Ilya Sutskever. 2019. Introduces Sparse Transformers.
-
Linformer: Self-Attention with Linear Complexity. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma. 2020. Introduces Linformers.
-
Rethinking Attention with Performers. K. Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy J. Colwell, Adrian Weller. ICLR 2020. Introduces Performers.
-
Efficient Transformers: A Survey. Yi Tay, M. Dehghani, Dara Bahri, Donald Metzler. 2020.
Decoder-only architectures:
- Language Models are Unsupervised Multitask Learners. Alec Radford, Jeff Wu, R. Child, D. Luan, Dario Amodei, Ilya Sutskever. 2019. Introduces GPT-2 from OpenAI.
- Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. NeurIPS 2020. Introduces GPT-3 from OpenAI.
- Scaling Language Models: Methods, Analysis&Insights from Training Gopher. Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, J. Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, G. V. D. Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John F. J. Mellor, I. Higgins, Antonia Creswell, Nathan McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, D. Budden, Esme Sutherland, K. Simonyan, Michela Paganini, L. Sifre, Lena Martens, Xiang Lorraine Li, A. Kuncoro, Aida Nematzadeh, E. Gribovskaya, Domenic Donato, Angeliki Lazaridou, A. Mensch, J. Lespiau, Maria Tsimpoukelli, N. Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Tobias Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, I. Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem W. Ayoub, Jeff Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, Geoffrey Irving. 2021. Introduces Gopher from DeepMind.
- Jurassic-1: Technical details and evaluation. Opher Lieber, Or Sharir, Barak Lenz, Yoav Shoham. 2021. Introduces Jurassic from AI21 Labs.
Encoder-only architectures:
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL 2019. Introduces BERT from Google.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, Veselin Stoyanov. 2019. Introduces RoBERTa from Facebook.
Encoder-decoder architectures:
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. M. Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer. ACL 2019. Introduces BART from Facebook.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, W. Li, Peter J. Liu. J. Mach. Learn. Res. 2019. Introduces T5 from Google.