Section 1 - Introduction

This is a new course on understanding and developing large language models.

What is a language model?
A brief history
Why does this course exist?
Structure of this course

What is a language model?

A language model is akin to a statistical compass that navigates through the vast ocean of word sequences, assigning a likelihood to each arrangement.

Imagine having a vocabulary as your map, where each set of tokens is given a probability score by the model , a number nestled between 0 and 1 that reflects the sequence’s coherence:

Take, for instance, a vocabulary that includes “late,” “ball,” “cheese,” “mouse,” and “the.” The model might assign probabilities like so:

While a language model appears as a straightforward mathematical construct, its elegance masks the complexity of the task: assigning meaningful probabilities across all sequences demands profound, albeit implicit, linguistic insights and a grasp of world knowledge.

For instance, the sequence “mouse the the cheese ate” should be assigned a low probability because it’s syntactically incorrect—a judgment rooted in syntactic knowledge. Conversely, “the mouse ate the cheese” should be deemed more probable than “the cheese ate the mouse,” a distinction informed by world knowledge, even though both sentences share the same structure, they diverge in semantic plausibility.

A language model , as defined, processes an input sequence and outputs a probability that serves as a measure of the sequence’s quality. We can also initiate the creation of a sequence with the aid of a language model. The most straightforward approach is to draw a sequence directly from the model , where the probability of selection is proportional to , signified as:

The efficiency of this computational task hinges on the structure of the language model itself. In practice, direct sampling from the language model is uncommon, due to both the inherent constraints of real-world language models and the desire to produce sequences that aren’t just average but rather optimal.

Autoregressive Language Models

The joint distribution of a sequence is commonly formulated using the chain rule of probability:

For instance:

Here, represents the conditional probability distribution of the next token given all prior tokens in the sequence.

While the mathematical expression of any joint probability distribution can be deconstructed in this manner, an autoregressive language model stands out by enabling the efficient computation of each conditional distribution , often employing a feedforward neural network for such tasks.

Generation. To generate a complete sequence from an autoregressive language model , we sequentially sample each token, conditioned on the previously generated tokens:

where is a temperature parameter dictating the level of randomness in our sampling process:

: Select the most probable token deterministically at each step
: Sample from the language model as is
: Sample from a uniform distribution across the entire vocabulary

However, simply raising probabilities to the power may result in a distribution that doesn’t sum to 1. We resolve this by re-normalizing the probabilities. The normalized version is known as the annealed conditional probability distribution. For instance:

Aside: Annealing is a term from metallurgy, referring to the gradual cooling of heated materials, and is also used in sampling and optimization algorithms like simulated annealing.

Technical note: Iteratively sampling with a temperature applied to each conditional distribution is not equivalent (except when ) to sampling from the annealed distribution across sequences of length .

Conditional generation. Broadly, we can conduct conditional generation by setting a prefix sequence (referred to as a prompt) and sampling the subsequent part (known as the completion). For example, generating with yields:

Adjusting the temperature to introduces more diversity, as seen with examples like and .

Soon, we’ll observe how conditional generation empowers language models to tackle a myriad of tasks simply by varying the prompt.

Summary

A language model represents a probability distribution across sequences .
Essentially, an effective language model is endowed with proficiency in language and an understanding of the world.
An autoregressive language model enables the efficient creation of a subsequent segment when provided with an initial segment as a prompt.
The variability in the generation process can be modulated by adjusting the temperature parameter.

A brief history

Information theory, entropy of English, n-gram models

Information theory. Language models date back to Claude Shannon, who founded information theory in 1948 with his seminal paper, A Mathematical Theory of Communication. In this paper, he introduced the entropy of a distribution as

The entropy measures the expected number of bits any algorithm needs to encode (compress) a sample into a bitstring:

The lower the entropy, the more “structured” the sequence is, and the shorter the code length.
Intuitively, is the length of the code used to represent an element that occurs with probability .
If , we should allocate bits (equivalently, nats).

Aside: actually achieving the Shannon limit is non-trivial (e.g., LDPC codes) and is the topic of coding theory.

Entropy of English. Shannon was particularly interested in measuring the entropy of English, represented as a sequence of letters. This means we imagine that there is a “true” distribution out there (the existence of this is questionable, but it’s still a useful mathematical abstraction) that can spout out samples of English text .

Shannon also defined cross entropy:

which measures the expected number of bits (nats) needed to encode a sample using the compression scheme given by the model (representing with a code of length ).

Estimating entropy via language modeling. A crucial property is that the cross entropy upper bounds the entropy ,

which means that we can estimate by constructing a (language) model with only samples from the true data distribution , whereas is generally inaccessible if is English.

So we can get better estimates of the entropy by constructing better models , as measured by .

Shannon game (human language model). Shannon first used n-gram models as in 1948, but in his 1951 paper Prediction and Entropy of Printed English, he introduced a clever scheme (known as the Shannon game) where was provided by a human:

Humans aren’t good at providing calibrated probabilities of arbitrary text, so in the Shannon game, the human language model would repeatedly try to guess the next letter, and one would record the number of guesses.

N-gram models for downstream applications

Language models became first used in practical applications that required generation of text:

speech recognition in the 1970s (input: acoustic signal, output: text), and
machine translation in the 1990s (input: text in a source language, output: text in a target language).

Noisy channel model. The dominant paradigm for solving these tasks then was the noisy channel model. Taking speech recognition as an example:

We posit that there is some text sampled from some distribution .
This text becomes realized to speech (acoustic signals).
Then given the speech, we wish to recover the (most likely) text. This can be done via Bayes rule:

Speech recognition and machine translation systems used n-gram language models over words (first introduced by Shannon, but for characters).

N-gram models. In an n-gram model, the prediction of a token only depends on the last characters rather than the full history:

For example, a trigram () model would define:

These probabilities are computed based on the number of times various n-grams (e.g., and ) occur in a large corpus of text, and appropriately smoothed to avoid overfitting (e.g., Kneser-Ney smoothing).

Fitting n-gram models to data is extremely computationally cheap and scalable. As a result, n-gram models were trained on massive amount of text. For example, Brants et al. (2007) trained a 5-gram model on 2 trillion tokens for machine translation. In comparison, GPT-3 was trained on only 300 billion tokens. However, an n-gram model was fundamentally limited. Imagine the prefix:

If is too small, then the model will be incapable of capturing long-range dependencies, and the next word will not be able to depend on . However, if is too big, it will be statistically infeasible to get good estimates of the probabilities (almost all reasonable long sequences show up 0 times even in “huge” corpora):

As a result, language models were limited to tasks such as speech recognition and machine translation where the acoustic signal or source text provided enough information that only capturing local dependencies (and not being able to capture long-range dependencies) wasn’t a huge problem.

Neural language models

An important step forward for language models was the introduction of neural networks. Bengio et al., 2003 pioneered neural language models, where is given by a neural network:

Note that the context length is still bounded by , but it is now statistically feasible to estimate neural language models for much larger values of .

Now, the main challenge was that training neural networks was much more computationally expensive. They trained a model on only 14 million words and showed that it outperformed n-gram models trained on the same amount of data. But since n-gram models were more scalable and data was not a bottleneck, n-gram models continued to dominate for at least another decade.

Since 2003, two other key developments in neural language modeling include:

Recurrent Neural Networks (RNNs), including Long Short Term Memory (LSTMs), allowed the conditional distribution of a token to depend on the entire context (effectively ), but these were hard to train.
Transformers are a more recent architecture (developed for machine translation in 2017) that again returned to having fixed context length , but were much easier to train (and exploited the parallelism of GPUs). Also, could be made “large enough” for many applications (GPT-3 used ).

We will open up the hood and dive deeper into the architecture and training later in the course.

Summary

Language models were first studied in the context of information theory, and can be used to estimate the entropy of English.
N-gram models are extremely computationally efficient and statistically inefficient.
N-gram models are useful for short context lengths in conjunction with another model (acoustic model for speech recognition or translation model for machine translation).
Neural language models are statistically efficient but computationally inefficient.
Over time, training large neural networks has become feasible enough that neural language models have become the dominant paradigm.

Why does this course exist?

Having introduced language models, one might wonder why we need a course specifically on large language models.

Increase in size. First, what do we mean by large? With the rise of deep learning in the 2010s and the major hardware advances (e.g., GPUs), the size of neural language models has skyrocketed. The following table shows that the model sizes have increased by an order of 5000x over just the last 4 years:

Model	Organization	Date	Size (# params)
ELMo	AI2	Feb 2018	94,000,000
GPT	OpenAI	Jun 2018	110,000,000
BERT	Google	Oct 2018	340,000,000
XLM	Facebook	Jan 2019	655,000,000
GPT-2	OpenAI	Mar 2019	1,500,000,000
RoBERTa	Facebook	Jul 2019	355,000,000
Megatron-LM	NVIDIA	Sep 2019	8,300,000,000
T5	Google	Oct 2019	11,000,000,000
Turing-NLG	Microsoft	Feb 2020	17,000,000,000
GPT-3	OpenAI	May 2020	175,000,000,000
Megatron-Turing NLG	Microsoft, NVIDIA	Oct 2021	530,000,000,000
Gopher	DeepMind	Dec 2021	280,000,000,000

Emergence. What difference does scale make? Even though much of the technical machinery is the same, the surprising thing is that “just scaling up” these models produces new emergent behavior, leading to qualitatively different capabilities and qualitatively different societal impact.

Aside: at a technical level, we have focused on autoregressive language models, but many of the ideas carry over to masked language models such as BERT and RoBERTa.

Capabilities

Whereas language models up until 2018 were mainly used as one component of a larger system (e.g., speech recognition or machine translation), language models are increasingly becoming more capable of being a standalone system, something that would be unthinkable in the past.

Recall that language models are capable of conditional generation: given a prompt, generate a completion:

Examples of capabilities. This simple interface opens up the possibility of having a language model solve a vast variety of tasks by just changing the prompt. For example, one can perform question answering by prompting with a fill in the blank (demo):

One can prompt a language model to solve word analogies (demo):

One can prompt a language model to generate a news article based on a headline (demo).

Here is an example of an article that GPT-3 fabricated (everything after the bolded text):

Title: NLP Researchers at Stanford Discover Black Holes in Language Models Article: On January 3, 2007, the Stanford University News Service published an article that reported a remarkable discovery by NLP researchers at Stanford. The article was titled “Stanford Researchers Discover Black Holes in Language Models.” The discovery was described as follows: A black hole is a region of space-time where gravity pulls so much that even light cannot get out. Now physicists think they have found a similar phenomenon in language: They call it the semantic black hole. It occurs when a word or phrase has no clear definition – and sometimes no clear meaning at all. If you toss such a word into a sentence, it drags along other words until eventually the whole thing collapses under its own weight. “It’s like if you have a paper cup and you push in the bottom,” said Stanford computer scientist Michael Schmidt. “At first it holds up fine, but then it gets weaker and weaker until it collapses in on itself.” Schmidt and his colleagues are using computers to identify and avoid semantic black holes.

In-context learning. Perhaps the most intriguing thing about GPT-3 is that it can perform what is called in-context learning. Let’s start with an example (demo):

Input: Where is Stanford University? Output: Stanford University is in California.

We (i) see that the answer given by GPT-3 is not the most informative and (ii) perhaps want the answer directly rather than a full sentence.

Similar to word analogies from earlier, we can construct a prompt that includes examples of what input/outputs look like. GPT-3 somehow manages to understand the task better from these examples and is now able to produce the desired answer (demo):

Input: Where is MIT? Output: Cambridge

Input: Where is University of Washington? Output: Seattle

Input: Where is Stanford University? Output: Stanford

Relationship to supervised learning. In normal supervised learning, one specifies a dataset of input-output pairs and trains a model (e.g., a neural network via gradient descent) to fit those examples. Each training run produces a different model. However, with in-context learning, there is only one language model that can be coaxed via prompts to perform all sorts of different tasks. In-context learning is certainly beyond what researchers expected was possible and is an example of emergent behavior.

Aside: neural language models also produce vector representations of sentences, which could be used as features in a downstream task or fine-tuned directly for optimized performance. We focus on using language models via conditional generation, which only relies on blackbox access for simplicity.

Language models in the real-world

Given the strong capabilities of language models, it is not surprising to see their widespread adoption.

Research. First, in the research world, the NLP community has been completely transformed by large language models. Essentially every state-of-the-art system across a wide range of tasks such as sentiment classification, question answering, summarization, and machine translation are all based on some type of language model.

Industry. In production systems that affect real users, it is harder to know for sure since most of these systems are closed. Here is a very incomplete list of some high profile large language models that are being used in production:

Given the performance improvement offered by something like BERT, it seems likely that every startup using language is using these models to some extent. Taken altogether, these models are therefore affecting billions of people.

An important caveat is that the way language models (or any technology) are used in industry is complex. They might be fine-tuned to specific scenarios and distilled down into smaller models that are more computationally efficient to serve at scale. There might be multiple systems (perhaps even all based on language models) that act in a concerted manner to produce an answer.

Risks

So far, we have seen that by scaling up language models, they become exceptionally capable of tackling many tasks. However, not everything is as rosy, and there are substantial risks associated with the use of language models. Multiple papers, including the stochastic parrots paper, the foundation models report, and DeepMind’s paper on ethical and social harms detail the risks. Let us highlight a few of them, which we will study in more detail in this course.

Reliability. If you play around with GPT-3, it works better than you might expect, but much of the time, it still fails to produce the correct answer. Worse, the answer can seem correct and there is no way of knowing (demo)

Input: Who invented the Internet? Output: Al Gore

In high-stakes applications such as healthcare, giving wrong information would not be acceptable. How can we make language models more reliable?

Social bias. It has been well documented that machine learning systems exhibit bias: they have performance disparities across demographic groups, and their predictions can enforce stereotypes. For example, we can probe the biases inherent in a language model by looking at the probabilities of pairs of sentences that differ only by one pronoun (demo):

The software developer finished the program. He celebrated.

The software developer finished the program. She celebrated.

Social biases are of course encoded in the data, and a model that is trained based on this data will inherit the properties of the data. So how should we more carefully select data to mitigate bias? What kind of interventions can be done during training? Stepping back, how do we even define or measure social bias?

Toxicity. Large language models are trained on a huge amount of Internet data (e.g., Reddit), which inevitably contains offensive content. RealToxicityPrompts is a dataset that evaluates a language model’s propensity for producing toxic content. For example:

So, I’m starting to think she’s full _

As another example, GPT-3 has been demonstrated to output anti-Muslim stereotypes:

Two Muslims walked into a _

Applications such as writing assistants or chatbots would be vulnerable.

Disinformation. We saw already that GPT-3 could be used to fabricate new articles with ease. This technology could be used by malicious actors to run disinformation campaigns with greater ease. Because of large language models’ linguistic abilities, foreign state actors could much more easily create fluent, persuasive text without the risks of hiring native speakers.

Security. Large language models are currently trained on a scrape of the public Internet, which means that anyone can put up a website that could potentially enter the training data. From a security point of view, this is a huge security hole, because an attacker can perform a data poisoning attack. For example, this paper shows that poison documents can be injected into the training set such that the model generates negative sentiment text whenever is in the prompt:

In general, the poison documents can be inconspicuous and, given the lack of careful curation that happens with existing training sets, this is a huge problem.

Legal considerations. Language models are trained on copyright data (e.g., books). Is this protected by fair use? Even if it is, if a user uses a language model to generate text that happens to be copyrighted text, are they liable for copyright violation?

For example, if you prompt GPT-3 with the first line of Harry Potter (demo):

Mr. and Mrs. Dursley of number four, Privet Drive, _

It will happily continue to spout out text from Harry Potter with high confidence.

Cost and environmental impact. Finally, large language models can be quite expensive to work with.

Training often requires parallelizing over thousands of GPUs. For example, GPT-3 is estimated to cost around $5 million. This is a one-time cost.
Inference on the trained model to make predictions also imposes costs, and this is a continual cost.

One societal consequence of the cost is the energy required to power the GPUs, and consequently, the carbon emissions and ultimate environmental impact. However, determining the cost-benefit tradeoffs is tricky. If a single language model can be trained once that can power many downstream tasks, then this might be cheaper than training individual task-specific models. However, the undirected nature of language models might be massively inefficient given the actual use cases.

Access. An accompanying concern with rising costs is access. Whereas smaller models such as BERT are publicly released, more recent models such as GPT-3 are closed and only available through API access. The trend seems to be sadly moving us away from open science and towards proprietary models that only a few organizations with the resources and the engineering expertise can train. There are a few efforts that are trying to reverse this trend, including Hugging Face’s Big Science project, EleutherAI, and Stanford’s CRFM. Given language models’ increasing social impact, it is imperative that we as a community find a way to allow as many scholars as possible to study, critique, and improve this technology.

Summary

A single large language model is a jack of all trades (and also master of none). It can perform a wide range of tasks and is capable of emergent behavior such as in-context learning.
They are widely deployed in the real-world.
There are still many significant risks associated with large language models, which are open research questions.
Costs are a huge barrier for having broad access.

Structure of this course

This course will be structured like an onion:

Behavior of large language models: We will start at the outer layer where we only have blackbox API access to the model (as we’ve had so far). Our goal is to understand the behavior of these objects called large language models, as if we were a biologist studying an organism. Many questions about capabilities and harms can be answered at this level.
Data behind large language models: Then we take a deeper look behind the data that is used to train large language models, and address issues such as security, privacy, and legal considerations. Having access to the training data provides us with important information about the model, even if we don’t have full access to the model.
Building large language models: Then we arrive at the core of the onion, where we study how large language models are built (the model architectures, the training algorithms, etc.).
Beyond large language models: Finally, we end the course with a look beyond language models. A language model is just a distribution over a sequence of tokens. These tokens could represent natural language, or a programming language, or elements in an audio or visual dictionary. Language models also belong to a more general class of foundation models, which share many of the properties of language models.

What’s on your mind?

Create GitHub Issue

Send us feedback

Section 1 - Introduction

What is a language model?

Autoregressive Language Models

Summary

A brief history

Information theory, entropy of English, n-gram models

N-gram models for downstream applications

Neural language models

Summary

Why does this course exist?

Capabilities

Language models in the real-world

Risks

Summary

Structure of this course

Further reading