Deep Learning

Lecture 13: Self-supervised learning

2021-11-21T00:00:00+00:00

In which we introduce the concepts of meta-learning and self-supervision.

Supervised (deep) learning has mainly gone after datasets such as MNIST or CIFAR-10/100, which have a small number of classes, and many samples per class.

But humans can generalize really well even with a very small number of examples per class! Think of the last time you saw the picture of an unknown animal. You clearly don’t need hundreds of examples in order to learn a concept.

Even worse: in several applications, you can’t get hundreds of examples anyway. Think of building an AI assistant to assist doctors in diagnosis: every test example may be new, critical cases are correlated with how rare they are, and large datasets are hard to find.

Therefore, it is of crucial importance moving forward to devise DL techniques that succeed with relatively few data points. An interesting early test bed is the Omniglot dataset, popularized in ML by Brendan Lake at NYU CDS, which can be thought of as “Transpose-MNIST” – lots of classes and very few samples per class. How do we effectively learn in this type of scenario?

Such problems fall into the realm of “few-shot” learning where “shots” here refer to the number of examples. For example, an n-class k-shot classification task requires learning to classify between $n$-classes using only $k$ (potentially $\ll n$) examples per class.

If an ML agent were given a $k$-shot dataset, how should it solve such a challenging task? The rapidly growing field of meta-learning advocates the following principles:

each learning agent trying to solve a new task is guided by a higher-level meta-learner
the meta-learner possesses meta-knowledge (in the form of features, or pre-trained nets, or other quantities) which is imparted to the learning agents when they are being trained.
(here is the crux) the meta-learner itself can be trainable, and is able to learn from experience as it teaches different agents.

Transfer learning

Let us be concrete. A canonical example of the above approach iss transfer learning. We have actually already discussed transfer learning before (and implemented it for the case of R-CNN type object detection).

The high level idea is that given an ML task with a limited-sample dataset, one starts with a pre-trained base network that has already been trained on perhaps a bigger dataset (like ImageNet for images, or Wikipedia for NLP), and uses the given dataset to fine-tune to any new given task.

The problem, of course, is that one requires a good enough base model to start with. In the examples seen so far, the base model has been pre-trained using a massive dataset. The essence of pre-training is to get “good enough” features which generalize well for the given task, and it is not entirely clear if such “good enough” features could be learned in the few-shot setting. Below we will address more principled ways of performing transfer learning in the few-shot setting.

Model-agnostic meta learning (MAML)

Back to transfer learning. A different way of thinking about the few-shot learning problem is to visualize the tasks as different points in the parameter space. In this scenario, transfer learning/fine-tuning can be viewed as a souped-up initialization procedure where we initialize the weights at some known, excellent point in the parameter space, and use the available few-shot data to move the weights to some point better suited to the task.

Of course, this assumes that the new task we are trying to learn is somehow close enough to the base model that it can be trained via a few steps of gradient descent. As different tasks are solved, can the meta-learner update the base model itself? If trained over sufficiently many tasks, then perhaps the base model is no longer required to be trained using a specific, large dataset – it can be a general model whose only goal is to be “fine-tunable” to different tasks using a few number of gradient descent steps. In that sense, this approach would be model-agnostic.

Let us formalize this idea (which is called model-agnostic meta-learning, or MAML). There is a base model $\phi$. There are $J$ different tasks. We use the base model $f_\phi$ – $\phi$ are the weights – as initialization. For each new task dataset $T_j$, we form a loss function (based on few-shot samples) $L(f_\phi, T_j)$ – this stands for the model $f$ with weights $\phi$ evaluated on training dataset $T_j$ – and fine-tune these weights using gradient descent. In the simplest case, if we used one step of gradient descent, this could be written as:

\[\phi_j \leftarrow \phi - \alpha \nabla_\phi L(f_\phi, T_j) .\]

If we used two steps of gradient descent, we would have to iterate the above equation twice. And so on. We use the final weights $\phi_j$ to solve task $T_j$.

The hope is that if the base model is good enough then the overall cumulative loss across different tasks at the adapted parameters is small as well. This is the meta-loss function:

\[M(\phi) = \sum_{j=1}^J L(f_{\phi_j},T_j) .\]

Notice the interesting nested structure:

The meta-loss function $M(\phi)$ depends on the adapted weights $\phi_j$
which in turn depend on the base weights $\phi$ via one or more steps of gradient descent.

So we can update the base weights themselves by summing up the gradients computed during the adaptation:

\[\begin{aligned} \phi &\leftarrow \phi - \beta \nabla_\phi M(\phi) \\ &= \phi - \beta \sum_j \nabla_\phi L(f_{\phi_j},T_j) \\ &= \phi - \beta \sum_j \nabla_\phi L(\phi - \alpha \nabla_\phi L(f_\phi, T_j)) . \end{aligned}\]

Some further observations:

the “samples” in the above update correspond to different tasks. One could use stochastic methods here to speed things up: the meta-learner samples a new learning agents, “teaches” them how to update their weights by giving them the base model, and “learns” a new set of base model weights.
“generalization” here corresponds to the fact that after a while, MAML learns parameters that can be adapted to new, unseen tasks via fine-tuning.
the above equation is specific to the learning agents in MAML using one step of gradient descent. But one could use any other optimization method here – $k$-steps of gradient descent, SGD, Adam, Hessian methods, whatever – call this method $\text{Alg}$. Then a general form of MAML is:

\[\phi \leftarrow \phi - \beta \sum_j \nabla L(\text{Alg}_j)\]

The only requirement is that there is some way to take the derivative of $\text{Alg}$ in the chain rule – i.e., MAML works by taking the gradient of gradient descent!

One last point: The above gradient updates in MAML can be quite complicated. In particular, the meta-gradient update requires a gradient-of-a-gradient (due to the chain rule) and already needs tons of computations. If we want to increase this to $k$-steps of gradient descent, then we need higher-order gradients. A series of algorithmic improvements have improved this computational dependency on the complexity of the optimizer, but we won’t cover it here.

Metric embeddings

An alternative family of meta-learning approaches is learning metric embeddings. The high level idea is to learn embeddings (or latent representations) of all data points in a given dataset (similar to how we learned word embeddings in NLP). If the embeddings are meaningful, then the geometry of the embedding space should tell us class information (and we should be able to use simple geometric methods such as nearest neighbors or perceptrons to classify points).

An early approach (pioneered by LeCun and collaborators in the nineties and revived a few years ago) is Siamese Networks. The goal was to solve one-shot image classification tasks, where we are given a database of exactly one image in each class.

Imagine a training dataset $(x_1, x_2, \ldots, x_n)$. The label indices don’t matter here since all the points are of distinct classes. Siamese nets work as follows.

set up a Siamese network (pair of identical, weight-tied feedforward convnets, followed by a second network). The first part (pair of identical networks) $f_\theta$ consists of a standard convnet mapping data points to some latent set of features; we use this to evaluate every pair of data points and get outputs $f_\theta(x_i)$ and $f_\theta(x_j)$.
compute the coordinate-wise distances

\[g(x_i, x_j) = |f_\theta(x_i) - f_\theta(x_j) | .\]

This gives a vector which is a measure of similarity between the embeddings.

Feed it through a second network that gives probabilities of matches, i.e., whether the two images are from the same class. A simple such network would :

\[p(x_i, x_j) = \sigma(W g(x_i, x_j))\]

Apply standard data augmentation techniques (noise, distortion, etc) and train the network using SGD.
Given a test image, match it with every point in the dataset. The final predicted class is the one with the max matching probability.

\[c(x) = \arg \max_{i \in S} P(x,x_i) .\]

This idea was refined to the $k$-shot case via Matching Networks by Vinyals and coauthors in 2016. The steps are similar to the ones above, except that we don’t compute distances in the middle, and use a trainable attention mechanism (instead of a standard MLP) to declare the final class:

\[c(x) = \arg \max_{i \in S} \sum_{i=1}^n \sum_{j = 1}^k a(x_{nk}, x) y_{nk} .\]

Other attempts along this line of work include:

Triplet networks (where we use three identical networks and train with triples of samples $(x’, x’’, x^{-})$ – the first two from the same class and the last from a different class.
Prototypical Networks
Relation Networks

among several others.

Contrastive self-supervision

This idea of using Siamese networks to learn useful embedding features for unlabeled/few-shot datasets is rather similar to the next-sentence-prediction task that we used to learn BERT-style representations in NLP.

We can use similar techniques for other data types too! For example, imagine that we were trying to learn embeddings for image- or video- data. The Siamese network idea works here too – as long as we develop a contrastive pretext task that enables us to devise embeddings and compare pairs (or triples) of inputs. The above example of Siamese networks corresponded to a “same-class-classification” pretext task. But we could think of others:

for images, one candidate pretext task could be to predict relative transformations: given two images, predict whether one is a rotation/crop/color transformation of the second, or not.
for video, one candidate pretext task could be shuffle-and-learn where given three frames, the goal is to shuffle the order back to a temporally coherent manner.
For audio-video, a candidate pretext task could be match whether the given audio corresponds to the video, or not.
Jigsaw puzzles: the input is a bunch of shuffled tiles, and the goal is to predict a permutation.

All these methods have been applied to varying degrees of success, culminating in SIMCLR (by Hinton and co-authors) which reached AlexNet-level performance on image classification using 100X fewer labeled samples.

The idea in SimCLR is surprisingly simple: given an image $x$, a good model must be able to distinguish “positive” examples (e.g. all natural geometic transformations of this image, say $T(x)$), from “negative” ones (e.g. other images from a minibatch not related to this one.)

The way it implements this is via learning features by optimizing a contrastive loss. Given an image $x$, a positive example $x_+$ sampled from $T(x)$, and a set of negative examples $N$, SIMCLR does two things:

Apply an encoder $g$ to all data points. Think of this encoder as (say) a standard ResNet.
Apply a small “projection head” $h$ to map the encoder features to a space where the loss can be applied. This can be, for example, a shallow MLP similar to what we used for Siamese networks above.
Let $z = h(g(x)), z_+ = h(g(x_+)), z_n = h(g(x_n))$. Take a gradient step that optimizes a cross-entropy style contrastive loss; here, $\tau$ is a learnable temperature parameter and $\sim$ is the cosine similarity between two vectors.

\[l = - \log \frac{\exp\left(\text{sim}(z,z_+)/\tau\right)}{\sum_{n \in N \cup \{z_+\}} \exp\left(\text{sim}(z,z_n)/\tau\right)}\]

Why does this loss make sense? Intuitively, starting from a totally unlabeled dataset we are setting up a classification problem where there are as many “classes” as samples in a minibatch. Therefore, a good feature embedding $g$ should learn to sufficiently separate out the different “classes”, i.e., group features corresponding to the same root image together while the rest far apart. In this sense, SIMCLR can be viewed as a considerable simplification of the Siamese Net idea.

Once feature embeddings have been learned, we can drop the project head, and just use the feature encoder $g$. For any new downstream task (even with a small number of examples), we can then throw a linear classifier on top and either freeze the encoder weights/learn only the top layer, or fine-tune all the weights, depending on how much data is present. Here is a visualization of comparisons with other self-supervised baselines:

For a nice illustrated summary, see here.

Contrastive Language-Image Pretraining

Simple ideas that work (like SIMCLR) usually lead to good things. One (surprisingly powerful) offshoot is language-image feature learning.

Say we have a large, unstructured dataset of captioned images. This dataset can be acquired (say) by doing a search of FlickR or Instagram or something else.

Using this dataset, we can use a SIMCLR-style approach to learn a multi-modal feature encoder, having one tower of weights for the language part (call it $g_l()$), and one tower of weights (call it $g_v()$) for the image part. These weights should satisfy two properties:

features learned by applying the image tower to an image, and the language to the corresponding caption, should be similar;
image features should be far away from the caption of features of other images in the minibatch.

So, basically SIMCLR, except that the “transformation” backbone is removed, and replaced by one that processes a natural language caption! This approach is called CLIP (Contrastive Language-Image Pretraining), and this model serves as the feature extraction bedrock of more exciting, subsequent developments in unsupervised generative models (such as Dall-E 2 and Stable Diffusion). See the [CLIP paper] for details; this figure is a great illustration.

A few more CLIP-isms:

Just as SIMCLR, we can do transfer learning by taking the image feature tower, throwing a linear layer on top, and finetuning to a new (given, small-size) dataset.
A beautiful benefit of CLIP is the ability to do zero-shot transfer. Say we wanted to build an animal classifier, but our training dataset had zero images of (say) a woolly mammoth. This would be an entirely new class label (outside the set of known concepts), and in the standard supervised setup, there would be no way of recognizing this new concept.

However, CLIP circumvents this as follows. To recognize the woolly mammoth, all we would have to provide is a caption in natural language describing the picture of a mammoth, something like “a photo of an animal that looks like an elephant but has brown fur and big tusks”.

Why would this work? Notice that we not only learned good image features via CLIP; we also aligned the features with text descriptions. Therefore, generalization to new, totally unseen categories can effectively happen, if somehow we were able to map the new category to a string of already-seen language tokens.

Technically speaking, this is not a fully unsupervised approach (there is the issue of coming with a suitable caption, or prompt), so this method can be viewed as weak language supervision.

Generative Pre-Training

Under construction.

Lecture 11: Applications of Deep RL

2021-11-21T00:00:00+00:00

In which we discuss success stories of deep RL, and the road ahead.

AlphaGo

A major landmark in deep learning research was the demonstration of AlphaGo in 2015, which was one of the success stories of deep RL in real(istic) applications.

Go is a two-player board game where the players take turns placing black “stones” on a 19x19 grid, and the goal is to surround the opponents’ pieces and “capture” territory. The winner is declared by counting each player’s surrounded territory.

The classical way to solve such two player games (and other like Chess) via AI is to search a game tree, where each node in the tree is a game state (or snapshot) and children nodes are results of possible actions taken by each player. The leaves of the tree denote end states, and the goal of the AI is to discover paths to valuable/winning leaves while avoiding bad paths. Leaving aside the definition of “value”, this is obviously a very large tree in both Chess and Go with leaves the number of leaves being exponential in the depth (i.e., the number of moves in the game).

[An aside: in Chess after sufficiently many moves there is a particular phase called the Endgame, after which the winning sequence of moves are more or less well understood, and can be hard coded. Computer chess heavily relied on this particular trick; unfortunately, endgames in Go are way more complicated, and solving Go via computer was viewed as a major bottleneck.]

One way to reduce the number of possible paths is to perform Monte Carlo Tree Search, which was a crude form of estimating the Value function $V(s)$ of each state (i.e., each node in the tree) via random search.

The beauty of DeepMind’s AlphaGo (which was introduced in 2016) is that it completely eschews a tree-based data structure for representing the game. Instead, the state of the game is represented by a 19x19 black/white/gray image, which is fed into a deep neural network – just like how we would classify an MNIST grayscale image. The output of the network is the instantaneous policy, i.e., distribution over possible next moves. The architecture is a vanilla 13-layer convnet.

In fact, just this network is enough to do well in Go. One can train this in a standard supervised learning manner using an existing database of game-state/next-move pairs, and beat computer Go players based on tree search nearly 99% of the time! But top human players were able to beat this model.

But AlphaGo leverages the fact that we can do even better with RL. We can update the above network using self-play, where we create new games by sampling rollouts using the predicted distribution, measure rewards at the end of the game, and use the REINFORCE algorithm for further updating the weights.

In addition to the policy network trained above, AlphaGo also constructs a second network (called the value network) which, for a given state, predicts which player has advantage. In some sense, one can view this analogous to how we motivated GANs: the policy network proposes actions to take, and the value network evaluates how good different actions are in terms of expected return. [Such an approach is called an actor-critic method, which discuss below.] There were other additional hacks thrown on top to make everything work, but this is the gist. Read the (very enjoyable) paper if you would like to learn more.

Actor-critic methods

The high level idea in actor-critic methods is to combine elements from both policy gradients as well as Q-learning. Recall that the key idea in policy gradients was the computation of the update rule using the log-derivative trick:

\[\frac{\partial}{\partial \theta} \mathbb{E}_{\pi(\tau)} R(\tau) \approx R(s,a) \frac{\partial}{\partial \theta} \log \pi_\theta(a | s).\]

Here, for simplicity we have ignored trajectories and assumed that policies only depend on the current state of the world, and rewards only depend on the current action that we take. The policy network $\pi_\theta$ outputs a distribution over actions; favorable actions are associated with higher probabilities. We call this the actor network.

Instead of using the reward $R$ directly (which could be sparse, or non-informative), we instead replace this via the expected discounted reward, which is essentially the Q-function $Q(s,a)$. But where does this value come from? To compute this, we use a second auxiliary neural network (call it $Q_\phi$ where $\phi$ denotes this auxiliary network’s weights). We call this the critic network.

This sets up an interesting game theoretic interpretation. The actor learns to play the game, and picks the best moves at each time step. The critic learns to estimate values of different actions by the actor, and keeps track of long-term future rewards. There are other concepts involved here (such as advantage) and two time-scale learning which we won’t get into here; best left for a detailed course on RL.

The overall algorithm proceeds as follows:

Initialize $\theta, \phi, s$
At each time step:

a. Sample $a’ \sim \pi_\theta(a’ s_t)$

b. Update actor network weights $\theta$ according to log-derivative trick

c. Compute Bellman error $\delta_t$

d. Update critic network weights $\phi$ according to Q-learning updates.

e. Update state to $s_{t+1}$ and repeat!

AlphaFold2

Lecture 11: Generative Adversarial Networks

2021-04-12T00:00:00+00:00

In which we introduce the concept of generative models and two common instances encountered in deep learning.

Much of what we have discussed in the first part of this course has been in the context of making deterministic, point predictions: given image, predict cat vs dog; given sequence of words, predict next word; given image, locate all balloons; given a piece of music, classify it; etc. By now you should be quite clear (and confident) in your ability to solve such tasks using deep learning (given, of course, the usual caveats on dataset size, quality, loss function, etc etc).

All of the above tasks have a well defined answer to whatever question we are asking, and deep networks trained with suitable supervision can find them. But modern deep networks can be used for several other interesting tasks that conceivably fall into the purview of “artificial intelligence”. For example, think about the following tasks (that humans can do quite well), that do not cleanly fit into the supervised learning:

find underlying laws/characteristics that are salient in a given corpus of data.
given a topic/keyword (say “water lily”), draw/synthesize a new painting (or 250 paintings, all different) based on the keyword.
given a photograph of a face (with the left half blacked out), mentally visualize how the rest would look like.
be able to quickly adapt to new tasks.
be able to memorize and recall objects.
be able to plan ahead in the face of uncertain and changing environments;

among many others.

In the latter part of the course we will focus on solving such tasks. Somewhat fortunately, the main ingredients of deep learning (feedforward/recurrent architectures, gradient descent/backpropagation, data representations) will remain the same – but we will put them together into novel formulations.

GANs

We will now discuss families of generative models that are able to accurately reproduce very “realistic” data, even in high dimensions (such as high resolution face images).

Certain tasks in ML have well defined objective functions. (For example, classification; the obvious metric here is the 0/1 loss, and cross-entropy is its natural continuous relaxation.)

Certain tasks don’t have a well-defined objective function. For example, if we ask a neural net to “draw a painting”, the loss function is not well-defined.

However, we can provide examples of paintings and hope to reproduce more of those. Mathematically, if there is a sub-manifold of all image data that correspond to paintings, we can think of it as a distribution, learn its parameters, and then sample from it. (This is roughly the philosophy we used last time, but note that we are not necessarily assigning likelihoods here.)

Let us use a different approach this time, and work backwards. Let’s say our generative model (which is a neural network) was able to generate a sample painting. Let’s say an oracle (or human) is available, who can eyeball the painting and returns YES if the sample painting is realistic enough, and NO if not. This piece of information can be viewed as a rough error signal — and if there was some way to “backpropagate” this error, we can use gradient descent to iteratively adjust the parameters of the network, and generate more and more samples until the sample output always passes the eye test.

Sounds like a good idea, except, having an actual human to check each sample is not feasible.

To resolve this issue, let us now assume that the oracle with a second neural network. We call this the discriminator or the critic, which — in principle — should be able to tell the difference between “real” data samples, obtained from nature, and “synthetic” data samples produced by the generator.

But this discriminator network itself needs to be trained in order to learn to distinguish between real and fake samples. The insight used in GANs is a clever bootstrapping technique, where the samples from the generator serve as the fake data samples and compared with a training dataset of real samples.

Moreover, the bootstrapping technique enables us to iteratively improve both the generator and the the discriminator. In the beginning, the discriminator does its job easily: the generator produces noise, and the discriminator quickly learns to figure out real vs fake. As training progresses, the generator begins to catch up, and the discriminator needs to adjust its parameters to keep up. In this way, GAN training can be viewed as a two-player game, where the goal of Player 1 (the generator) is to fool the discriminator, and the goal of Player 2 (the discriminator) is to not be fooled by the discriminator. This is called adversarial training, and hence the name “GAN”.

Somewhat interestingly, this type of learning procedure seems to achieve state-of-the-art generative modeling results. The results are impressive: can you figure out which of these dog images are fake and which are real?

Mathematics of GANS

Let us now cast the above discussion into a typical 3-step ML framework (representations, objective function, and optimization algorithm.)

We denote $G_\Theta(\cdot)$ to be the generator. Here, $\Theta$ represents all the weights/biases of the generator network. As mentioned above, unlike regular neural networks used for classification/regression, the network is architecture is “reversed” – it takes in as input a low-dimensional latent code vector $z$, and produces a high dimensional data vector (such as an image) as output. Recall that in a regular network, dimensionality is successively reduced through the layers (via pooling/striding); in a GAN generative network, dimensionality is successively expanded via upsampling or dilated/transpose convolutions.

We denote $D_\Psi(\cdot)$ to be the discriminator. This is a regular feedforward or convnet architecture, and produces an output probability of an input data sample being real or fake.

Let $y$ be the label where $y=1$ denotes real data and $y=0$ denotes fake data. For a given input, we will train the discriminator to minimize the cross-entropy loss:

\[L(\Psi) = - y \log D_\Psi(x) - (1-y) \log (1 - D_\Psi(x))\]

The first term disappears if $x$ is fake ($y=0$), and the second term disappears if $x$ is real ($y=1$). Fake data samples can be produced by sampling $z \sim \text{Normal}(0,I)$ and passing it through the generator network to produce $G_\Theta(z)$. So the loss function now becomes:

\[L(\Theta,\Psi) = - E_{x \sim \text{real}} \log D_\Psi(x) - E_{z \sim \text{Normal}(0,I)} \log (1 - D_\Psi(G_\Theta(z))) ,\]

where now the goal of the generator is to fool the discriminator as much as possible (i.e., maximize $L$). So the two-player game now becomes:

\[\max_\Theta \min_\Psi L(\Theta,\Psi) .\]

In the literature, it is conventional to flip min- and max-, and negate the loss function. So the standard GAN objective now becomes:

\[L(\Theta,\Psi) = E_{x \sim \text{real}} \log D_\Psi(x) + E_{z \sim \text{Normal}(0,I)} \log (1 - D_\Psi(G_\Theta(z)))\]

We now discuss how to train this network. In each iteration, we sample two minibatch of real data samples and fake data samples. Then, we form the above objective function and take gradients. The gradient with respect to the discriminator is used to update the weights $\Psi$ (note that since we are minimizing with respect to $\Theta$ and maximizing with respect to $\Psi$, this is an algorithm called gradient descent-ascent):

\[\begin{aligned} \Theta &\leftarrow \Theta - \eta \nabla_\Theta L(\Theta,\Psi) \\ \Psi &\leftarrow \Psi + \eta \nabla_\Psi L(\Theta,\Psi) \end{aligned}\]

In practice, other updates (such as Adam) may be used.

Note that due to all the hacks above, we cannot quite calculate likelihoods the way we do in the case of flow-models. For this reason, GANs are instances of likelihood-free generative models.

Challenges, extensions, and examples

There are a couple of issues with GAN training that we need to keep in mind.

One issue is the form of the loss itself. Observe above that the generator weights only get updated by the gradients of the second term:

\[\log (1 - D_\Psi(G_\Theta(z)))\]

since they do not appear in the first. The problem with this is that if the generator sample is really bad (as is typically the case in the beginning of training), then the discriminator’s prediction is close to zero, and since $\log (1- D)$ is very flat when $D \approx 0$ there is not enough ‘signal’ to move the generator weights meaningfully. Increasing learning rates do not seem to help. This is called the saturation problem in GANs.

To fix this, while updating generator weights, it is common to heuristically replace the second term in the GAN loss with:

\[- \log D_\Psi(G_\Theta(z))\]

A comparison of the two losses are shown below. This solves the saturation problem, but note that now the gradients close to zero are suddenly very high and training becomes unstable. Stably training GANs was a challenge faced by the community for quite some time (and continues to be a challenge), and a common resolution is to use Wasserstein GANs. We won’t get into the details here, but the high level idea is that the above GAN loss function can be viewed as a specific form of distance between probability distributions (called the Jensen-Shannon divergence), and this can be generated to other distances. A common alternative is the Earth-mover or Wasserstein distance, leading to a different type of GAN model called Wasserstein GAN. There is a lengthy derivation involved, but the loss function becomes:

\[L^{WGAN}_(\Theta,\Psi) = E_{x \sim \text{real}} f(D_\Psi(x)) + E_{z \sim \text{Normal}(0,I)} - f(D_\Psi(G_\Theta(z))) .\]

where $f$ is a monotonic function that is 1-Lipschitz. In practice, this property can be implemented via a procedure called gradient-clipping, but let’s not get into the weeds.

A third issue is something called mode collapse. If we stare closely, suppose that the network $G_\Theta(z)$ is accidentally trained such that it always produces a fixed output $\hat{x}$ no matter what the $z$ is (i.e., the range of $G$ collapses to a single point), and that the output $\hat{x}$ exactly matches a sample from the real dataset. This leads to zero loss, and hence is an optimal solution! So in some sense, the network has memorized exactly one data sample from the training dataset – so it has not really learned the distribution – but the GAN loss function does not really distinguish between the two regimes.

This is actually not an isolated occurrence. Even if the generator does not memorize a given data point, it could just memorize a set of weights to produce fake data points that somehow the discriminator does not do very well on. This is a consequence of the two-player game; the generator can “win” by finding a “cheat code” set of weights that is over-optimized to fool the particular discriminator, and not necessarily actually solving the game (of learning the probability distribution).

Mode collapse can be viewed as a specific form of overfitting, and there are a few ways to avoid this: early stopping helps; so does changing the objective function to encourage diversity in mini-batches; and so does adding noise to the discriminator/generator outputs (a la dropout).

Lots more tricks to get GANs working (and we won’t get into all of them) here, but here are some representative images.

{ width=100% }

Conditional GANs

The above types of GAN models enable sampling from the data distribution: choose a random new latent code vector $z$ and generate a new sample $x = G(z)$.

In practice, however, it would be nice to have some kind of user control over the outputs. For example, the following applications:

Category-dependent generation
Image style transfer

Simple example: class-conditional GAN. Say MNIST digit. This is easy; we just augment the input $z$ with the class label $c$, and feed the same to the discriminator. So in some sense, a subset of features in the code vector fed to the generator are clearly interpretable as categorical input codes.

\[L(\Theta,\Psi) = E_{x \sim \text{real}} \log D_\Psi(x | c) + E_{z \sim \text{Normal}(0,I)} \log (1 - D_\Psi(G_\Theta(z | c)))\]

A harder problem is image style transfer. Say we want the content to remain the same but change the weather, or change night to day, or change artistic style. The issue with this kind of problem is that labels are hard to find (how do we get pairs of images with same content but different style?)

A way to achieve is this called cycle consistency. At a high level, the generative model consists of three networks simultaneously trained:

Train two generative nets: $G_1$ for Style 1 to Style 2, and $G_2$ for style 2 back to Style 1.
Use a discriminator to ensure that samples from $G_1$ (Style 2) are indistinguishable from real data.
Use a reconstruction loss to make sure that $G_2$ learns to invert $G_1$.

Examples:

Variational Autoencoders

We won’t discuss VAEs in great detail. (The machinery is quite a bit involved, and they don’t work as well as GANs.) Autoencoders are fairly simple to understand. These consist of two networks $f_\theta$ and $g_\phi$, concatenated back-to-back trained using the reconstruction loss:

\[L(\theta,\phi) = \frac{1}{n} \sum_{i=1}^n \|x^{i} - f_\theta(g_\phi(x^{i})) \|^2 .\]

The simplest example of an autoencoder is when the functions $f_\theta$ and $g_\phi$ are single layers in a neural network with linear activation (i.e., linear mappings). Then the loss becomes:

\[L(U,V) = \frac{1}{n} \sum_{i=1}^n \|x^{i} - U V^T x^{i}) \|^2 .\]

which is equivalent to principal components analysis (PCA). The number of hidden units equals the number of principal components.

The output of $g_\phi$ can be viewed as a compressed representation of the input. This part is called the encoder, and the second part is called the decoder. Once this network is trained we can just take the decoder part and feed in different latent vectors to generate new samples, just like in GANs.

At a high level, variational autoencoders is an example of this approach. The architecture of a VAE looks like this:

where both the encoder and decoder represent probabilistic mappings. The loss function used to train this pair of network resembles the log-likelihood of the data samples (the same as that used to train normalizing flows/etc), but is augmented with a regularizer Kullback-Leibler Divergence, which encourages the distribution in the latent code ($z$-) space to become Gaussian; so the overall loss looks like this:

\[L_{\text{VAE}}(\theta,\phi) = - E_{z \sim q_\phi} \log p_\theta(x | z) + D_{KL}(q_\phi(z | x) || p_\theta(z))\]

which is minimized over both $\theta$ and $\phi$. We will skip the details; refer here for a rigorous treatment.

Lecture 12: Diffusion Models

2021-04-01T00:00:00+00:00

In which we discuss the foundations of generative neural network models.

Unsupervised Learning and Generative Models

Motivation

find underlying laws/characteristics that are salient in a given corpus of data.
given a topic/keyword (say “water lily”), draw/synthesize a new painting (or 250 paintings, all different) based on the keyword.
given a photograph of a face (with the left half blacked out), mentally hallucinate how the rest would look like.
be able to quickly adapt to new tasks.
be able to memorize and recall objects.
be able to plan ahead in the face of uncertain and changing environments;

among many others.

In the next few lectures we will focus on solving such tasks. Somewhat fortunately, the main ingredients of deep learning (feedforward/recurrent architectures, gradient descent/backpropagation, data representations) will remain the same – but we will put them together into novel formulations.

Tasks such as classification/regression are inherently discriminative – the network learns to figure out the answer (or label) for a given input. Tasks such as synthesis are inherently generative – there is no one answer, and instead the network will need to figure out a probability distribution (or, loosely, a set) of possible answers to it. Let us see how to train neural nets that learn to produce such distributions.

[Side note: machine learning/statistics has long dealt with modeling uncertainty and producing distributions. Probabilistic models for machine learning is a vast area in itself (independent of whether we are studying neural nets or not). We won’t have time to go into all the details – take an advanced statistical learning course if you would like to learn more.]

Setup

Let us lay out the problem more precisely. In terms of symbols, instead of learning weights $W$ that learn a discriminative function mapping of the form: $y = f_W(x)$ we will instead imagine that the space of all $x$ is endowed with some probability distribution $p(x)$. This may be a distribution that is without any conditions (e.g., all face images $x$ are assigned high values of $p(x)$, and the set of all images that are not faces are assigned low values of $p(x)$). Or, this may be a conditional distribution $p(x; c)$. (Example: the condition $c$ may denote hair color, and the set of all face images with that particular hair color $c$ will be assigned higher probability versus the rest).

If there was some computationally easy way to represent the distribution $p(x)$, we could do several things:

we could sample from this distribution. This would give us the ability to synthesize new data points.
we could evaluate the likelihood of a given test data point (e.g. answering the question: does this image resemble a face image?)
we could solve optimization problems (e.g. among all potential designs of handbags, find the ones that meet color and cost criteria)
perhaps learn conditional relationships between different features

etc.

The question now becomes: how do we computationally represent the distribution $p(x)$? Modeling distributions (particularly in high-dimensional feature spaces) is not easy – this is called the curse of dimensionality — and the typical approach to resolve this is to parameterize the distribution in some way: $p(x) := p_\Theta(x)$ and try to figure out the optimal parameters $\Theta$ (where we will define what “optimal” means later).

Classical machine learning and statistical approaches start off with simple parameterizations (such as Gaussians). Gaussians are nice in many ways: they are exactly characterized by their mean and (co)variance. We can draw samples easily from Gaussians. Central limit theorem = any set of independent samples averaged over sufficiently many draws resembles a Gaussian. Computationally, we like Gaussians.

Unfortunately, nature is far from being Gaussian! Real-world data is diverse; multi-modal; discontinuous; involves rare events; and so on, none of which Gaussians can handle very well.

Second attempt: Gaussian mixture models. These are better (multi-modal) but still not rich enough to capture real datasets very well.

Enter neural networks. We will start with some simple distribution (say a standard Gaussian) and call it $p(z)$. We will generate random samples from $p$; call it $z$. We will then pass $z$ through a neural network: $x = f_\Theta(z)$ parameterized by $\Theta$. Therefore, the random variable $x$ has a different distribution, say $p(x)$. By adjusting the weights we can (hopefully) deform $p(z)$ to obtain a $p(x)$ that matches any distribution we like. Here, $z$ is called the latent variable (or sometimes the code), and $f$ is called the generative model (or sometimes the decoder).

How are $p(x)$ and $p(z)$ linked? Let us for simplicity assume that $f$ is one-to-one and invertible, i.e., $z = f_\Theta^{-1}(x)$. Then, we can use the Change-of-Variables formula for probability distributions. In one dimension, this is fairly intuitive to understand: in order to conserve mass, the area of the intervals must be the same, i.e., $p(x)dx = p(z)d(z)$ and hence the probability distributions must obey:

{ width=40% }

\[p(x) = p(z) | \frac{dx}{dz} |^{-1}\]

When both $x$ and $z$ have more than one dimension, we have to replace areas by (multi-dimensional) volumes and derivatives by partial derivatives. Fortunately, volumes correspond to determinants! Therefore, we can get an analogous formula by replacing the absolute value by the determinant of the Jacobian of the mapping $x = f(z)$:

\[p(x) = p(z) | \frac{\partial x}{\partial z} |^{-1}\]

This gives us a closed-form expression to evaluate any $p(x)$, given the forward mapping. However, note that for this formula to hold, the following conditions must be true:

$f$ must be one-to-one and easily invertible.
$f$ needs to be differentiable, i.e., the Jacobian must be well-defined.
The determinant of the Jacobian must be easy to invert.

Reversible Models

As a warmup, a simple approach that ensures all of the above conditions are called reversible models. Recall the residual block that we discussed in the context of CNNs: this is similar. Residual blocks implement: $x = z + F_\Theta(z)$ where $F_\Theta$ is some differentiable network that has equal input and output size. (You can use ReLUs too but strictly speaking we should use differentiable nonlinearities such as sigmoids). Typically, $F_\Theta$ is a dense shallow (single- or two-layer network).

Reversible models use the above block as follows. We will consider two auxiliary random variables $u$ and $v$ as the same size as $x$ and $z$, and define two paths: $\begin{aligned} x &= z + F_\Theta(u), \\ v &= u . \end{aligned}$ The variable $u$ is called an additive coupling layer. If you don’t like adding an extra variable for memory reasons (say), you can just split your features into two halves and proceed.

The advantage of this model is that the inverse of this forward model is easy to calculate! Given any $x$ and $v$, the inverse of this model is given by: $\begin{aligned} u &= v, \\ z &= x - F_\Theta(u) . \end{aligned}$

What about the determinant of the Jacobian? Turns out that reversible blocks have very simple expressions for the determinant. For each layer, the Jacobian is of the form: $\left( \begin{array}{cc} \frac{\partial x}{\partial z} & \frac{\partial x}{\partial u} \\ \frac{\partial v}{\partial z} & \frac{\partial v}{\partial u} \end{array} \right) = \left( \begin{array}{cc} I & \frac{\partial F_\theta}{\partial u} \\ 0 & I \end{array} \right)$ which is an upper-triangular matrix with diagonal equal to 1. Such matrices have determinant equal to 1 always (and such transformations are hence called “volume preserving”). In other words, each reversible block maps a set to another set of the same volume.

Having defined a single reversible block, we can now chain multiple such reversible blocks into a deeper architecture by alternating the roles of $x$ and $v$. Let’s say we have a second such block $F_\Psi$ applied to $v$ and $u$. Then, we get the following two-layer architecture: $\begin{aligned} x' &= z + F_\Theta(u) \\ z' &= u + F_\Psi(x') \end{aligned}$

(Exercise: can you compute the inverse of this two-layer block?)

Turns out that each such block is volume preserving, and hence the determinant of the overall Jacobian (no matter how many blocks we stack) are all equal to unity. We can think of each layer as incrementally changing the distribution until we arrive at the final result. Such a model that implements this type of incremental change is called a “flow” model. (The specific form above was called NICE – short for Nonlinear Independent Components Estimation).

We finally come to training this model. Different objective functions can be used: a common one is maximum likelihood: given a dataset of $n$ samples $x_1, x_2, \ldots, x_n$ we optimize for the parameters that maximize the overall likelihood: $L(\Theta) = \prod_{i=1}^n p_X(x_i) = \prod_{i=1}^n p_Z(f^{-1}(x_i))$ where $p_Z$ is the base distribution. (Note that the Jacobian disappears.) In practice, sums are easier to optimize than products, and therefore we use the log-likelihood instead.

Normalizing Flows

Reversible blocks are nice from a compute standpoint, but have architectural limitations due to the volume preserving constraint.

Normalizing Flows (NF) generalize the above technique, and allow the mapping to be non-volume preseerving (NVP). The idea is to assume an arbitrary series of maps: $f_1, f_2, \ldots, f_L$ (where $L$ is the depth), so that: $x = f_L \odot \ldots \odot f_2 \odot f_1(z) .$ Define $z_0 := z$ and $z_i$ as the output of the $i$-th layer. Applying the change-of-variables formula to any intermediate layer, we have the distributional relationship: $\log p(z_i) = \log p(z_{i-1}) - \log | \text{det} \frac{\partial z_i}{\partial {z_{i-1}}} |.$ and recursing over $i$, we have the log likelihood: $\log p(x) = \log p(z) - \sum_{i=1}^L \log | \text{det} \frac{\partial z_i}{\partial z_{i-1}} .$ This is a bit more complicated to evaluate, but in principle it can be done.

To make life simpler, in NF, we use the same principles as we did for reversible architectures:

easy inverses for each layer
easy Jacobian determinant

but this time, instead of creating an additive coupling layer $u$, we use an affine coupling layer: $\begin{aligned} x &= z \odot \exp(F_\Theta(u)) + F_\Psi(u), \\ v &= u . \end{aligned}$ where $F_\theta$ and $F_\Psi$ are trainable functions, and $\odot$ is applied component wise. The inverse of the affine coupling layer is simple: $\begin{aligned} u &= v, \\ z &= (x - F_\Psi(u)) \odot \exp(-F_\Theta(u)). \\ \end{aligned}$ Moreover, the Jacobian has the following structure: $J = \left( \begin{array}{cc} \frac{\partial x}{\partial z} & \frac{\partial x}{\partial u} \\ \frac{\partial v}{\partial z} & \frac{\partial v}{\partial u} \end{array} \right) = \left( \begin{array}{cc} \text{diag}(\exp(F_\Theta(u))) & \frac{\partial F_\theta}{\partial u} \\ 0 & I \end{array} \right)$ which is an upper-triangular matrix, but with an easy-to-calculate determinant: $det(J) = \exp(\sum_{i=1}^d F^{i}_\theta(u)).$

Autoregressive models, diffusion models, etc

The above generative architectures are feedforward, dense, and useful for static structured data (such as images). For sequential data (such as music), we can develop similar models using RNN-type architectures.

The differences between the methods lie in the details, but the basic idea is that the features in the output $x$ are not simultaneously generated (as in a feedforward network), but rather, generated one after the other. Moreover, since certain types of sequence data (such as voice or music) usually respect causality, the architectures are restricted to be auto-regressive, i.e., the probability distribution of every generated sample $x$ is decomposed as: $P(x) = \pi_{i=1}^d p(x[i] | x[0:i])$ (where we are abusing Python notation here). Various typical assumptions (e.g. Markov-ness) are made to simplify this, just as how we did for an RNN. But fundamentally, since there are $d$ terms in the above product one would have to “unroll” the operation into a depth-$d$ network, which can be rather challenging.

WaveNet, used for audio signal generation, reduces the depth in a smart way: it uses a technique called dilated convolution that effectively reduces the depth to $\log d$ by grouping together symbols and effectively using parallelism. We won’t get into any further detail here.

Under construction

Lecture 12: Diffusion Models

2021-04-01T00:00:00+00:00

In which we discuss the foundations of generative neural network models.

Co-authored with Teal Witter.

Motivation

find underlying laws/characteristics that are salient in a given corpus of data.
given a topic/keyword (say “water lily”), draw/synthesize a new painting (or 250 paintings, all different) based on the keyword.
given a photograph of a face (with the left half blacked out), mentally hallucinate how the rest would look like.
be able to quickly adapt to new tasks.
be able to memorize and recall objects.
be able to plan ahead in the face of uncertain and changing environments;

among many others.

Setup

If there was some computationally easy way to represent the distribution $p(x)$, we could do several things:

we could sample from this distribution. This would give us the ability to synthesize new data points.
we could evaluate the likelihood of a given test data point (e.g. answering the question: does this image resemble a face image?)
we could solve optimization problems (e.g. among all potential designs of handbags, find the ones that meet color and cost criteria)
perhaps learn conditional relationships between different features

etc.

Unfortunately, nature is far from being Gaussian! Real-world data is diverse; multi-modal; discontinuous; involves rare events; and so on, none of which Gaussians can handle very well.

Second attempt: Gaussian mixture models. These are better (multi-modal) but still not rich enough to capture real datasets very well.

\[p(x) = p(z) | \frac{dx}{dz} |^{-1}\]

\[p(x) = p(z) | \frac{\partial x}{\partial z} |^{-1}\]

This gives us a closed-form expression to evaluate any $p(x)$, given the forward mapping. However, note that for this formula to hold, the following conditions must be true:

$f$ must be one-to-one and easily invertible.
$f$ needs to be differentiable, i.e., the Jacobian must be well-defined.
The determinant of the Jacobian must be easy to invert.

Reversible Models

(Exercise: can you compute the inverse of this two-layer block?)

Normalizing Flows

Reversible blocks are nice from a compute standpoint, but have architectural limitations due to the volume preserving constraint.

To make life simpler, in NF, we use the same principles as we did for reversible architectures:

easy inverses for each layer
easy Jacobian determinant

Diffusion models

Motivation

We had previously discussed generative adversarial networks (GANs). GANs create realistic images by playing a game between two neural networks called the generator and discriminator; the generator learns to create realistic images while the discriminator learns to differentiate between these fake images and the real ones. Unfortunately, GANs suffer from a problem called mode collapse. During training, the generator can memorize a single image from the real data set and the discriminator (correctly) determines that the image is real. The problem is that training stops because the generator has achieved minimum loss. Now, we’re stuck with a generator that only outputs one image (to be fair, the one image does look real).

There have been several attempts to mitigate mode collapse by trying different loss functions (a loss function based on Wasserstein distance is particularly effective) and adding a regularization term (the idea is to force the generator to use the random noise it’s given as input). However, these and similar approaches cannot completely prevent mode collapse.

So we’re left with the same problem we had before: how to generate realistic images. Instead of GANs, the deep learning community has recently turned to diffusion …and the results are astounding. The basic idea of diffusion is to repeatedly apply a de-noising process to a random state until it resembles a realistic image. Let’s dive into the details.

Diffusion Process

Our goal is to turn random noise into a realistic image. The starting point of diffusion is the simple observation that, while it’s not obvious how to turn noise into a realistic image, we can turn a realistic image into noise. In particular, starting from a real image in our data set we can repeatedly apply noise (typically drawn from a normal distribution) until the image becomes completely unrecognizable. Suppose $x_0$ is the real image drawn from our data set. Let the first noised image be $x_1 = x_0 + \epsilon_1$ where the noise $\epsilon_1$ is drawn from a normal distribution $\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ for some variance $\sigma$. In this way, we can generate $x_{t} = x_{t-1} + \epsilon_{t}$ where $\epsilon_{t}$ is again drawn from the same distribution. We end up with a sequence $x_0, x_1, \ldots, x_T$ where the total number of steps $T$ is chosen so that $x_T$ looks entirely meaningless.

In this example, we turned a picture of Stripes on a bike into complete gibberish by adding normal noise five times. The key insight is that if we look at this sequence backwards then we have training data which we can use to teach a model how to remove noise.

Formally, the training data consists of $x_t$, $t$, and $x_{t-1}$. Our goal is to train a model $f_\theta$ to predict $x_{t-1}$ from $x_t$ and $t$. A very natural choice of loss function is then

$\mathcal{L}(\theta) = \mathbb{E} [\| x_{t-1} - f_\theta(x_t, t) \|^2]$ where the expectation is over $x_t$, $t$, and $x_{t-1}$. However, researchers have found that it’s actually better for $f_\theta$ to predict the noise $\epsilon_{t}$ and then subtract it from $x_t$ to get $x_{t-1}$. Formally, the loss function is

$\mathcal{L}(\theta) = \mathbb{E} [\| \epsilon_{t} - f_\theta(x_t, t) \|^2]$ where the expectation is again over $x_t$, $t$, and $x_{t-1}$ which induces $\epsilon_t = x_t - x_{t-1}$.

Once we have have a working $f_\theta$, we can use it to generate realistic images. We start with random noise which we’ll call $x_T’$. Then for $t=T,\ldots, 1$, we predict $\epsilon_t’ = f_\theta(x_t’, t)$ and compute $x_{t-1}’ = x_t’ - \epsilon_t’$. The final result $x_0’$ is what the diffusion process outputs. With any luck, this output is a realistic image.

Thinking back to our three-step recipe for machine learning, we have the loss function and optimizer (SGD, as usual) but how do we choose a good architecture?

Autoencoders and U-Nets

We’ll start with a high level description of autoencoders, work our way to u-nets, and then tie it all back to diffusion. I like to think about autoencoders in relation to GANs. Recall that GANs go small-big-small: they turn a small noise vector into an image (using the generator) and then convert the image into a scalar representing its realness (using the discriminator). In contrast, autoencoders go big-small-big: they compress a real image into a small embedding (using the encoder) and then reconstruct the original image from the embedding (using the decoder).

We can also think of the architecture of autoencoders in relation to GANs. Just like the discriminator, the encoder uses convolutional layers to go “small” while, just like the generator, the decoder uses tranposed convolutional layers to go “big”. The loss function typically used for autoencoders is the ($\ell_2$-norm) difference between the real image and the reconstructed image.

The real benefit of autoencoder architectures is that we get a meaningful representation of an image that somehow captures its “inherent” properties. In our current setting, one might think we can use this inherent meaning to differentiate the true content of the image from noise. And that’s exactly the motivation for the architecture we’ll use for the diffusion model.

In particular, we’ll use what’s called a u-net. The u-net consists of convolutions and transposed convolutions tied together with pooling and residual connections. The model gets its distinctive name from the shape of its architecture (see below).

[[Source]](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/)

Diffusion in Latent Space

At this point, we have the loss function, architecture, and optimizer for diffusion. What more could we need? Well, one nagging issue is that we’re adding and predicting noise in the high-dimensional pixel space (the space gets even larger if we want higher resolution!). This presents a computational problem since we’ll need lots of parameters and compute for our u-net. One novel contribution of stable diffusion is to apply the autoencoder idea again in a different way.

Looking at the first few noised versions of Stripes in our example, we could probably differentiate the noise (or at least identify his vague outline). But, to a computer, the visual properties of the pixel space that we’re so sensitive to are useless. We might as well embed the images into a latent space which captures more meaning. Then, within the latent space, we build a model to convert noise to an embedding of a realistic image. Once we have the final output of the diffusion process, we decode into an image that we can understand. This is exactly what stable diffusion does and, in as a result, it gains the efficiency of working in a smaller, more meaningful space.

Text Conditioning

The really cool part of stable diffusion is that it generates an image of any text prompt we give it. But we’ve only talked about diffusion as an unconditional process for turning noise into realistic images. What we want is a way of guiding the u-net through the denoising process so that it generates an image close to the text prompt we give it. We accomplish this by embedding the training images and their text descriptions in the same latent space. Using contrastive language image pretraining (CLIP), we ensure that the embedding of an image is close to the embedding of its description.

Now, we train the u-net on the embedded noised image $x_t$, the number of noise steps $t$, and the embedded text description $w$ of the original embedded image $x_0$. Formally, our goal is for $f_\theta(x_t, t, w) \approx \epsilon_t$. But how do we feed the u-net the embedded text in a meaningful way? Stable diffusion uses an architecture with cross-attention heads after the residual connection. The cross-attention is between the embedded text and the u-net’s representation of the embedded image. Intuitively, cross-attention gives the u-net a way of conditioning the de-noising process on the text description. This is very helpful: if we were told a noisy image depicts on a cat on a bike, we would see it differently than if we were told it depicts a flying turtle.

Once we have a working $f_\theta$, we can guide it through the denoising process with an embedded text prompt $w$. We start with noise in the latent space $x_T’$ and for $t=T, \ldots, 1$, we predict $\epsilon_t’ = f_\theta(x_t’, t, w)$ and compute $x’_{t-1} = x_t’ - \epsilon_t’$. Stable diffusion then decodes $x_0’$ into pixel space and, with any luck, the result is an image of the text prompt we started with.

Lecture 10: Reinforcement Learning (II)

2021-03-30T00:00:00+00:00

In which we continue laying out the basics of reinforcement learning.

Recall that in the previous lecture we talked about a new mode of ML called reinforcement learning (RL), where the observations occur in a dynamic environment, and the learning module (also called the agent) needs to figure out the best sequence of actions to be taken (also called the policy) in order to maximize a given objective (also called the reward).

We also discussed a method called Policy Gradients, which uses the log-derivative trick to rewrite the problem in such a way that we can use standard ML tools (such as SGD) to learn a good RL policy. This led to an algorithm called REINFORCE (or Monte Carlo Policy Search), which can be viewed as an instantiation of random search used in derivative-free optimization.

(Aside: notice that nowhere in the above discussion did deep learning show up – indeed, RL can be used in very general settings. In the context of policy gradients, deep learning arises only if we choose to parameterize the policy in terms of a deep neural network.)

Today, we will learn about a different family of RL approaches which does something slightly different.

Q-Learning

Recall the setup in policy gradients:

The agent receives a sequence observations (in the form of e.g. image pixels) about the environment.
The state at time $t$, $s_t$, is the instantaneous relevant information of the agent.
The agent can choose an action, $a_t$, at each time step $t$ (e.g. go left, go right, go straight). The next state of the game is determined by the current state and the current action:

\[s_{t+1} \sim f(s_t, a_t) .\]

Here, $f$ is the state transition function that is entirely determined by the environment. We use the symbol $\sim$ to denote the fact that environments could be random and an action may sometimes have unpredictable consequences.

The agent periodically receives rewards/penalties as a function of the current state and action, $r_t = r(s_t,a_t)$.
The sequence of state-action pairs $\tau_t = (s_0, a_0, s_1, a_1, \ldots, a_t, s_t)$ is called a trajectory or rollout. The rollout is usually defined over a fixed time horizon $L$. In policy gradients, our goal is to minimize the (negative) reward:

\[\begin{aligned} \text{minimize}~&R(\tau) = \sum_{t=0}^{L-1} - r(s_t, a_t), \\ \text{subject to}~&s_{t+1} \sim f(s_t,a_t) \\ & a_t \sim \pi(\tau_t) . \end{aligned}\]

Let us now think of the problem in a slightly different fashion, which is somewhat more applicable in the context of goal-oriented RL. Instead of choosing good actions to take at each time step, an alternative is to identify (a sequence of) good states to visit. For simplicity, it is convenient to assume discrete spaces for both states and actions. It is also convenient to think in terms of episodes instead of rollouts. So each episode could be viewed as one run of a game.

This makes sense in the context of games: the ultimate goal is to reach the “win” state, just as how the ultimate goal in chess is to have the board result in a “checkmate” of the opponent. A common simple example given in the RL literature is the game of Frozen Lake (taken from OpenAI Gym), where the objective is to skate along the surface of a (frozen) lake, modeled as a 4x4 grid, from a starting position to the goal without falling into any “holes” in the lake. (The ice is slippery, so there is some randomness in the environment.)

This is a rather simple game (there are 16 states, and 4 actions per state). But one could model more complex RL problems too in this manner. In autonomous navigation, for example, the ultimate state is achieved when the agent has reached the destination, and other states along the leading to this final “win” state are likely to be also good states.

(In fact, this idea of looking backwards from the “win” state, and identifying which states lead to wins, is exactly the same principle that we use in dynamic programming (DP). As we will see soon, what we will discuss below can be viewed as an approximate version of DP.)

The way we characterize “good states” is by a quantity called the value function. To understand this, we first need to define the return, which is the sum of all anticipated rewards in the future over an infinite time horizon. In practice, we cannot sum over infinitely many rewards so we discount future rewards by a decay factor $\gamma$, leading to the discounted return:

\[G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots\]

“Good states” are likely to provide good returns, provided a sensible policy is chosen. The value function of a state $s$ under a given policy $\pi$ is defined as the expected discounted return if we start at $s$ and obey $\pi$:

\[V^{\pi}(s) = \mathbb{E} [G_t | s_t = s] = \mathbb{E} [ \sum_{i=0}^\infty \gamma^i r_{t+i} | s_t = s] .\]

The value function gives us a way to identify good states versus not-so-good ones, but it does not quite tell us how to reach these states. In order to do so, we need to go one more step: define an action-value function, or a Q-function, which is defined as the expected discounted return if we start at $s$, take action $a$, and subsequently follow the policy:

\[Q^{\pi}(s,a) = \mathbb{E} [G_t | s_t = s, a_t = a] .\]

Since we have assumed (for convenience) that both the state and action spaces are discrete, we can think of the Q-function as a giant table (similar to the table that we encounter in DP). Also, by law of iterated expectation, we can link the Q-function and the value function by just averaging over all possible actions, weighted by the likelihood of choosing action $a$ under the policy:

\[V^{pi} = \sum_a \pi(a | s) Q^{pi}(s,a) .\]

The Q-function gives us a way to determine the optimal policy as follows. If the Q-function were available (somehow, and we will discuss how to learn it), we could just choose optimal actions by picking the one that maximizes the expected return:

\[\pi^*(s) = \arg \max_a Q(s,a)\]

All this sounds good, but how do actually we discover the Q-function? And where does learning enter the picture?

Algorithms for Q-learning

The key to Q-learning is a recursive characterization of the optimal Q-function called the Bellman Equation, similar to how DP tables are recursively constructed. There is a formal derivation in the probabilistic case, which we won’t derive here. But intuitively the Bellman equation states that if the policy is optimally chosen, then the $Q$ function at the current time step is the current reward, plus the best return achievable at the next time step.

\[Q^*(s_t,a_t) = r(s_t,a_t) + \gamma \max_{a'} Q^* (s_{t+1},a')\]

The Bellman equation also gives us a way to perform learning in the RL setting. We start with an estimate of the $Q$-function (say, an empty table, or a table with random values). We start at some state $s$, take an action, collect a reward $r$, and then move to the next state $s’$ (in short, the quadruple $(s,a,r,s’)$). The Bellman error is defined as the mean-squared error between the current estimate $Q$ and the predicted estimate:

\[l = \frac{1}{2} (r + \gamma \max_{a'} Q(s,a') - Q(s,a))^2\]

which is a quantity that we (as ML engineers) love to see, since we can immediately use this error term to perform gradient descent:

\[Q(s,a) \leftarrow Q(s,a) + \eta \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]\]

and that’s it! The above procedure can be repeated by sampling different states and actions, observing the rewards, and updating the Q-function as we go along.

There is a small catch here though: note that this limits $Q$-learning to visited states and actions; but next actions are picked according to the table itself, which are plausibly optimal. This means that certain state-action pairs are never visited. Sometimes the agent needs to pick sub-optimal actions in order to visit new states; this is a common issue in RL called the exploration-exploitation tradeoff.

The easy fix is to choose an $\epsilon$-greedy policy: with probability $\epsilon$, we choose a random action, and with probability $1-\epsilon$, we choose the optimal action according to $Q$. So the overall algorithm becomes the following.

Initialize $Q$, repeat (for each episode):

Initialize $s$
Repeat for each step of episode:
- Choose an action $a$ using $\epsilon$-greedy policy
- Take action $a$, observe reward $r$ and state $s’$
- $Q(s,a) \leftarrow Q(s,a) + \eta \left[ r + \gamma \max_{a’} Q(s’,a’) - Q(s,a) \right]$
- $s \leftarrow s’$
Until $s$ is the end-state.

The above algorithm can be implemented with any game engine/simulator.

Deep Q-learning

So far, we have imagined both actions and states to be discrete spaces, and hence $Q$ is a table.

There are two issues here:

Impractical (too many states in many cases, even infinite if we are talking about continuous environments)
No structure/shared information between states and actions

Similar to policy gradients, one can resolve this by parameterizing the $Q$-function. This can be done in a few different ways. For example, we could do a linear function approximation:

\[Q(s,a) = w^T \psi(s,a)\]

where $\psi$ is some feature embedding of the tuple $(s,a)$. (As to where the embedding comes from: this is identical to the challenge of “word embeddings” for NLP, and similar techniques can be used here, which we won’t discuss.)

Or, alternatively, we could think of $Q$ to be some deep neural network, parameterized by weights $w$. The latter would be called deep Q-learning. One can prove that the Bellman equation remains the same, so the only thing that changes is the gradient descent equation:

\[\begin{aligned} t &\leftarrow r + \gamma \max_{a'} (t - Q(s',a')) \\ w &\leftarrow w + (t - Q(s,a)) \frac{\partial Q(s,a)}{\partial w} \end{aligned}\]

A bit of history: among the several breakthroughs in deep learning that happened in the early 2010s was the success of neural nets to crack 80’s-style Atari games in 2013. A rather shallow network with 3 hidden layers was used; see Figure 2.

Comparisons with policy gradients

In contrast with policy gradients (which directly learn policies), Q-learning introduces an intermediate quantity (the Q-function) that explicitly assigns value to states and actions.

Pros of policy gradients:

There is no Q-function table to be populated, so one can handle large, or even continuous, action spaces.
No need to model intermediate variables (such as Value/Q-function); the model directly estimates the policy.
Unlike Q-learning (where the final optimal policy is deterministic: it is the max over all actions for a given state), policy gradients can output stochastic/non-deterministic policies. This is useful in games without stable equilibria (such as Rock-Paper-Scissors) where there is no single deterministic policy that is the best.

Pros of DQN:

Q-learning is (generally) more sample-efficient (recall policy gradients are similar to random search). Therefore, with a fixed number of episodes/training data, Q-learning tends to perform better.
Q-learning gives an estimate of anticipated return at each time step, which can be useful in higher-level planning, reasoning, and control tasks.

Lecture 9: Reinforcement Learning (I)

2021-03-29T00:00:00+00:00

In which we introduce the basics of reinforcement learning.

Throughout this course, we have primarily focused on supervised learning (building a prediction function from labeled data), and briefly also discussed unsupervised learning (generative models and word embeddings). In both cases, we have assumed that the data to the machine learning algorithm is static and the learning is performed offline.

Neither assumption is true in the real world! The data that is available is often influenced by previous predictions that you have made. (Think, for example, of stock markets.) Moreover, data is continuously streaming in, so one needs to be able to adapt to uncertainties and unexpected pitfalls in a potentially adverse environment.

Applications that fall into this category include:

AI for games (both computer/video games as well as IRL games such as Chess or Go)
teaching robots how to autonomously move in their environment
self-driving cars
algorithmic trading in markets

among others.

This set of applications motivates a third mode of ML called reinforcement learning (RL). The field of RL is broad and we will only be able to scratch the surface. But several of the recent success stories in deep learning are rooted in advances in RL – the most high profile of them are Deepmind’s AlphaGo and OpenAI’s DOTA 2 AI, which were able to beat the world’s best human players in Go and DOTA 2 respectively. These AI agents were able to learn winning strategies entirely automatically (albeit by leveraging massive amounts of training data; we will discuss this later.)

To understand the power of RL, consider – for a moment – how natural intelligence works. An infant presumably learns by continuously interacting with the world, trying out different actions in possibly chaotic environments, and observing outcomes. In this mode of learning, the input(s) to the learning module in the infant’s brain is decidedly dynamic; learning has to be done online; and very often, the environment is unknown before hand.

For all these reasons, the traditional mode of un/supervised learning does not quite apply, and new ideas are needed.

A quick aside: the above questions are not new, and the formal study of these problems actually classical. The field of control theory is all about solving optimization problems of the above form. But the approaches (and applications) that control theorists study are rather different compared to those that are now popular in machine learning.

Setup

We will see that RL is actually “in-between” supervised and unsupervised learning.

The basis of RL is an environment (modeled by a dynamical system), and a learning module (called an agent) makes actions at each time step over a period of time. Actions have consequences: actions periodically lead to reward, or penalty (equivalently, negative reward). The goal is for the agent to learn the best policy that maximizes the cumulative reward. All fairly intuitive!

Here, the “best policy” is application-specific – it could refer to the best way to win a game of Space Invaders, or the best way to allocate investments across a portfolio of stocks, or the best way to navigate an autonomous vehicle, or the best way to set up a cooling schedule for an Amazon Datacenter.

All this is a bit abstract, so let us put this into concrete mathematical symbols, and interpret them (as an example) in the context of the classic iOS game Temple Run, where your game character is either Guy Dangerous or Scarlett Fox and your goal is to steal a golden idol from an Aztec temple while being chased by demons. (Fun game. See Figure 1.) Here,

The environment is the 3D game world, filled with obstacles, coins, etc.
The agent is the player.
The agent receives a sequence observations (in the form of e.g. image pixels) about the environment.
The state at time $t$, $s_t$, is the instantaneous relevant information of the agent (e.g. the 2D position and velocity of the player).
The agent can choose an action, $a_t$, at each time step $t$ (e.g. go left, go right, go straight). The next state of the game is determined by the current state and the current action:

\[s_{t+1} = f(s_t, a_t) .\]

Here, $f$ is the state transition function that is entirely determined by the environment. In control theory, we typically call this a dynamical system.

The agent periodically receives rewards (coins/speed boosts) or penalties (speed bumps, or even death!). Rewards are also modeled as a function of the current state and action, $r(s_t,a_t)$.
The agent’s goal is to decide on a strategy (or policy) of choosing the next action based on all past states and actions:

\[s_t, a_{t-1}, s_{t-1}, \ldots, s_1, a_1, s_0, a_0.\]

The sequence of state-action pairs $\tau_t = (s_0, a_0, s_1, a_1, \ldots, a_t, s_t)$ is called a trajectory or rollout. Typically, it is impractical to store and process the entire history, so policies are chosen only over a fixed time interval in the past (called the horizon length $L$).

So a policy is simply defined as any function $\pi$ that maps $\tau$ to $a_t$. Our goal is to figure out the best policy (where “best” is defined in terms of maximizing the rewards).

But as machine learning engineers, we can fearlessly handle minimization/maximization problems! Let us try and apply the ML tools we know here. Pose the cumulative negative reward as a loss function, and minimize this loss as follows:

\[\begin{aligned} \text{minimize}~&R(\tau) = \sum_{t=0}^{L-1} - r(s_t, a_t), \\ \text{subject to}~&s_{t+1} = f(s_t,a_t) \\ & a_t = \pi(\tau_t) . \end{aligned}\]

The cumulative reward function $R(\tau)$ is sometimes replaced by the discounted cumulative reward, in case we exponentially decay the reward across time with some factor $\gamma > 0$:

\[R_{\text{discounted}}(\tau) = \sum_{t=0}^{L-1} - \gamma^t r(s_t, a_t) .\]

OK, this looks similar to a loss minimization setting that we are all familiar with. We can begin to apply any of our optimization tools (e.g. SGD) to solve it. Several caveats emerge, however, and we have to be more precise about what we are doing.

First, what are the optimization variables? We are seeking the best among all policies $\pi$ (which, above, are defined as functions from trajectories to actions), so this means that we will have to parameterize these policies somehow. We could imagine $\pi$ to be a linear model that maps trajectories to actions, or kernel model, or a deep neural network. It really does not matter conceptually (although it does matter a lot in practice).

Second, what are the “training samples” provided to us and what are we trying to learn? The key assumptions in RL is that everything in the general case is probabilistic:

the policy is stochastic. So what $\pi$ is actually predicting from a given trajectory is not a single best action but a distribution over actions. More favorable actions get assigned higher probability and vice versa.
the environment’s dynamics, captured by $f$, can be stochastic.
the reward function itself can be stochastic.

The last two assumptions are not critical – for example, in simple games, the dynamics and the reward are deterministic functions; but not so in more complex environments, such as the stock market – but the first one (stochastic policies) is fundamental in RL. This also hints to why we are optimizing over probabilistic policies in the first place: if there was no uncertainty and everything was deterministic, an oracle could have designed an optimal sequence of actions for all time before hand.

(In older Atari-style or Nintendo video games, this could indeed be done and one could play an optimal game pretty much from memory: Youtube has several examples of folks playing games like Super Mario blindfolded.)

Since policies are probabilistic, they induce probability distribution over trajectories, and hence the cumulative negative reward is also probabilistic. (It’s a bit hard to grasp this, considering that all the loss functions that we have talked about until now in deep learning have been deterministic, but the math works out in a similar manner.) So to be more precise, we will need to rewrite the loss in terms of the expected value over the randomness:

\[\begin{aligned} \text{minimize}~&\mathbb{E}_{\pi(\tau)} R(\tau) = \sum_{t=0}^{L-1} - r(s_t, a_t), \\ \text{subject to}~&s_{t+1} = f(s_t,a_t) \\ & a_t = \pi(\tau_t),~\text{for}~t = 0,\ldots,L-1. \end{aligned}\]

This probabilistic way of thinking makes the role of ML a bit more clear. Suppose we have a yet-to-be-determined policy $\pi$. We pick a horizon length $L$, and execute this policy in the environment (the game engine, a simulator, the real world, \ldots) for $L$ time steps. We get to observe the full trajectory $\tau$ and the sequence of rewards $r(s_t,a_t)$ for $t=0,\ldots,L-1$. This pair is called a training sample. Because of the randomness, we simulate multiple such rollouts, and compute the cumulative reward averaged over all such rollouts, and adjust our policy parameters until this expectation is maximized.

We now return to the first sentence of this subsection: why RL is “in-between” supervised and unsupervised learning. In supervised learning we need to build a function that predicts label $y$ from data features $x$. In unsupervised learning there is no separate label $y$; we typically wish to predict some intrinsic property of the dataset of $x$. In RL, the “label” is the action at the next time step, but once taken, this action becomes part of the training data and influences the subsequent action. This issue of intertwined data and labels (due to the possibility of complicated feedback loops across time) makes RL considerably more challenging.

Policy gradients

Let us now discuss a technique to numerically solve the above optimization problem. Basically, it will be a smart version of ‘trial-and-error’ – sample a rollout with some actions; if the reward is high then make those actions more probable (i.e., “reinforce” these actions), and if the reward is low then make those actions less probable.

In order to maximize expected cumulative rewards, we will need to figure out how to take gradients of the reward with respect to the policy parameters.

Recall that trajectories/rollouts $\tau$ are a probabilistic function of the policy parameters $\theta$. Our goal is to compute the gradient of the expected reward, $\mathbb{E}_{\pi(\tau)} R(\tau)$ with respect to $\theta$. To do so, we will need to take advantage of the log-derivative trick. Observe the following fact:

\[\begin{aligned} \frac{\partial}{\partial \theta} \log \pi(\tau) &= \frac{1}{\pi(\tau)} \frac{\partial \pi(\tau)}{\partial \theta},~\text{i.e.} \\ \frac{\partial \pi(\tau)}{\partial \theta} &= \pi(\tau) \frac{\partial}{\partial \theta} \log \pi(\tau) . \end{aligned}\]

Therefore, the gradient of the expected reward is given by:

\[\begin{aligned} \frac{\partial}{\partial \theta} \mathbb{E}_{\pi(\tau)} R(\tau) &= \frac{\partial}{\partial \theta} \sum_{\tau} R(\tau) \pi(\tau) \\ &= \sum_\tau R(\tau) \frac{\partial \pi(\tau)}{\partial \theta} \\ &= \sum_\tau R(\tau) \pi(\tau) \frac{\partial}{\partial \theta} \log \pi(\tau) \\ &= \mathbb{E}_{\pi(\tau)} [R(\tau) \frac{\partial}{\partial \theta} \log \pi(\tau)]. \end{aligned}\]

So in words, the gradient of an expectation can be converted into an expectation over a closely related quantity. So instead of computing this expectation, like in SGD we sample different rollouts and compute a stochastic approximation to the gradient. The entire pseudocode is as follows.

Repeat:

Sample a trajectory/rollout $\tau = (s_0, a_0, s_1, \ldots, s_L)$.
Compute $R(\tau) = \sum_{t=0}^{L-1} - r(s_t, a_t)$
$\theta \leftarrow \theta - \eta R(\tau) \frac{\partial}{\partial \theta} \log \pi(\tau)$

There is a slight catch here, since we are reinforcing actions over the entire rollout; however, actions should technically be reinforced only based on future rewards (since they cannot affect past rewards). But this can be adjusted by suitably redefining $R(\tau)$ in Step 2 to sum over the $t^{th}$ time step until the end of the horizon.

That’s it! This form of policy gradient is sometimes called REINFORCE. Since we are sampling rollouts, this is also called Monte Carlo Policy Gradient.

In the above algorithm, notice that we never require direct access to the environment (or more precisely, the model of the environment, $f$) – only the ability to sample rollouts, and the ability to observe corresponding rewards. This setting is therefore called model-free reinforcement learning. A parallel set of approaches is model-based RL, which we will briefly touch upon next week.

Second, notice that since we don’t require gradients, this works even for non-differentiable reward functions! In fact, the reward can be anything – non-smooth, non-differentiable, even discontinuous (such as a 0-1 loss).

Connection to random search

In the above algorithm, in order to optimize over rewards, observe we only needed to access function evaluations of the reward, $R(\tau)$, but never its gradient. This is in a departure from the regular gradient-based backpropagation framework we have been using thus far. The REINFORCE algorithm is in fact an example of derivative free optimization, which involves optimizing functions without gradient calculations.

Another way to do derivative free optimization is simple: just random search! Here is a quick introduction. If we are minimizing any loss function $f(\theta)$, recall that gradient descent updates $\theta$ along the negative direction of the gradient: $\theta \leftarrow \theta - \eta \nabla f(\theta) .$

But in random search, we pick a random direction $v$ to update $\theta$, and instead search for the (scalar) step size that provides maximum decrease in the loss along that direction. This is a rather inefficient way to minimize a loss function (for the same intuition that if we are trying to walk to the bottom of a valley, it is much better to follow the direction of steepest descent, rather than bounce around randomly.) But in the long run, random search does provably work as well. The pseudocode is as follows:

Sample a random direction $v$
Search for the step size (positive or negative) that minimizes $f(\theta + \eta v)$. Let that step size be $\eta_{\text{opt}}$.
Set $\theta \leftarrow \theta + \eta_{\text{opt}} v$.

Again, observe that the gradient of $f$ never shows up! The only catch is that we need to do a step size search (also called line search). However, this can be done quickly using a variation of binary search. Notice the similarity of the update rules (at least in form) to REINFORCE.

Let us apply this idea to policy gradients. Instead of the log-derivative trick, we will simply assume deterministic policies (i.e., a particular choice of policy $\theta$ leads to a deterministic rollout $\tau$) use the above algorithm, with $f$ being the reward function. The overall algorithm for policy gradient now becomes the following.

Repeat:

Sample a new policy update direction $v$.
Search for the step size $\eta$ that minimize $R(\theta + \eta v)$.
Update the policy parameters $\theta \leftarrow \theta + \eta v$.

Done!

Details and extensions

We have only touched upon the bare minimum required to understand policy gradients in RL. This is a very vast area of emerging work and we cannot unfortunately do justice to all of it. Let us touch upon some practical aspects/concerns that may be of importance while trying to build RL systems.

First, the problem with REINFORCE is that we are replacing the expected value with a sample average in the gradient calculation, but unlike in standard SGD-type training, the variance of the sample average will be typically too high. This means that vanilla policy gradients will be far too slow and unreliable.

The standard solution is to perform variance reduction. One way to adjust the variance is via insertion of a quantity called the reward baseline. To understand this, observe that unlike regular gradient descent type training methods (which by definition depend on the slope/gradient of the loss), REINFORCE depends on the absolute value, not the change, of the reward function $R(\tau)$. This does not quite make sense: if a constant bias (of say +1000) is added uniformly to the reward function, the problem does not change fundamentally (we are just rewriting the reward on a different scale) but the algorithm changes quite a bit: in every iteration, every set of weights is likely to be reinforced positively no matter whether the action taken was good or bad.

A simple fix is to baseline-adjusted descent: subtract a baseline $b$ from the reward function $R(\tau) - b$. Here is the method: we learn a baseline such that good actions are always associated with positive reward, and bad actions are associated with negative reward. This is hard to do properly, and it is important to re-fit the baseline estimate each time. In the discounted reward case, we have to re-adjust the baseline depending on $\gamma$.

Another point in policy gradients is that we do not require a differentiable reward/loss, but we do require that the mapping $\pi$ from trajectories to actions is differentiable. That’s the only way we can properly define $\partial \log \pi$ in the policy gradient update step (and that’s where standard neural net training methods such as backprop enter the picture).

To fix this, there is a class of techniques in RL called Evolutionary search (ES) that removes backprop entirely. The idea is to define the choice of policy itself as probabilistic functions (so $\pi$ itself can be viewed as being drawn from a distribution over functions) and apply the log-derivative trick there. It’s a bit complicated (and the gains over policy gradient are somewhat questionable) so we will not discuss this in detail here.

Lecture 8: Applications in NLP

2021-03-09T00:00:00+00:00

In which we see the power of deep networks in natural language processing.

In our discussion on deep learning for text, we have mainly focused on the middle part of this picture:

All neural network models assume real-valued vector inputs, and we have assumed that there is some magical way to convert discrete data (such as text) to a form that neural networks can process.

Today we will focus on the bottom part. Where do the word encodings come from? And how do they interact with the rest of the learning?

word2vec

The easiest way to encode words/tokens into real-valued vectors is one that we have already used for image-classification type applications: one-hot encoding.

Pros: this is dead simple to understand and implement.

Cons: there are two major drawbacks of using one-hot encodings.

Each encoding can become very high dimensional. By definition, the encoded vectors are now the size of the vocabulary/dictionary. At the character level this is fine; at the word level it becomes very difficult; and at any higher level the space of symbols becomes combinatorially large.
More than just computation: one-hot encodings do not capture semantic similarities (all words are equally far in L1/L2/Hamming distance than every other word). It would be nice to have similar words share similar features (where the meaning of “similar” depends on the language and/or context).

This was recognized by early NLP researchers. In the mid-2000s, a host of encoding methods were proposed, includng Latent semantic analysis (LSA), singular value decomposition (SVD). All of these were superseded by Word2vec, which came up in the early 2000s.

Word2vec is a word encoding framework that uses one of two approaches: skip-grams and continuous bag-of-words.

Skip-grams

In skip-grams, each word has two vector embeddings: $v_i$ and $u_i$. Let us first motivate why we need two such embeddings. As a running example, we will keep in mind a sentence such as:

“It is raining right now”.

and imagine it being represented as a sequence of words $x_1, x_2, x_3, x_4, x_5$.

We already discussed $n$-grams briefly before while motivating language models. For $n=2$, these are the joint probabilities

\[P(x_1,x_2), P(x_2,x_3), ...,\]

each of which can be empirically calculated by counting the number of co-occurrences of pairs of words in a database. Equivalently, it is easier to express this in terms of conditional probabilities

\[P(x_2 | x_1), P(x_3 | x_2), ...\]

The term “skip-gram” comes from the fact that we consider conditional probabilities that are not-consecutive, i.e., words can be skipped over. (The reason for exploring relationships between non-consecutive words goes back to the non-local, long-range dependency structure of natural languages.) In this case, the factorization is done with respect to the “center” (or “target”) word of the sequence; the other words are called “context” words. So the above factorization becomes:

\[P(x_1 | x_3) \cdot P(x_2 | x_3) \cdot P(x_4 | x_3) \cdot P(x_5 | x_3)\]

Intuitively, these probabilities tell us: “if a word $x_i$ appears in a sentence, how likely is it that the word $x_j$ will appear in its vicinity?” Here, “vicinity” would mean a window of some fixed size.

Having defined non-local conditional probabilities, the algorithmic question now becomes: how to estimate them? Again, one could just use the frequency of co-occurence counts in some large text corpus. However, we will depart from the standard approach, and instead train a simple neural network that predicts

\[P(x_j | x_i).\]

The network will be two layers deep (i.e., a single hidden layer of neurons with linear activations), followed by a softmax.

Some more details about this network. Say we have a dictionary of $N$ words. The input is a one-hot encoding $x_i$ of any given word $i$ (so, $N$ input neurons). The output is a vector of pre-softmax logits (so, $N$ output neurons). We can imagine (say) a hidden layer of $d$ (linear) neurons. So if we call $V \in \mathbb{R}^{d \times N}$ and $U \in \mathbb{R}^{d \times N}$ the two layers, then the conditional probability of any context word given the center is given by:

\[\begin{aligned} P(x_j | x_i) &= \text{softmax}(U^T V x_i) \\ &= \text{softmax}(U^T v_i ) \\ &= \text{softmax}([u_1^T v_i; u_2^T v_i; \ldots u_N^T v_i]) . \end{aligned}\]

So examining the rows of $U$ and $V$ give us precisely what we want – the word embeddings for the target and the context words respectively. Typically, $d \ll N$, so the embedding dimension is much smaller than the size of the vocabulary.

Using the rows of $U$ and $V$ as embeddings also intuitively makes sense: similar words/synonyms should give us similar output probabilities, and in order for two outputs to be similar, both target and context probabilities must match.

How do we train this network? First, we need to define a loss function. We can just use the standard cross-entropy loss, where the network is fed pairs of words (one-hot encoded) as data-label pairs. So for a particular pair of target-context words $i$ and $j$, we get the loss term:

\[l(i,j) = u_j^T v_i - \log\left(\sum_j \exp(u_j^T v_i)\right)\]

whose derivative can then be used to update all the weights.

There are a more few issues here to be considered. In English, for example, there are about $N = 10K$ commonly used words. So we already have approximately $6M$ weights to learn. Second, training can be extremely slow, since for every sample pair we have to touch all the weights. The word2vec paper did a few extra hacks (hierarchical softmax, negative sampling) to make this work, which we won’t dive into here – more details in an NLP course perhaps. See Chapter 14 of the textbook if you are interested.

Continuous Bag of Words (CBOW)

The CBOW model is very similar to the skip-gram model, so we won’t get into too much detail. THe main difference is that the CBOW model flips things around: instead of the center word defining the context, the context words are used to predict the target. So the conditional probabilities become:

\[P(x_3 | x_1, x_2, x_4, x_5)\]

which cannot be easily factorized the way we did so above. But the expression remains similar, if we approximate the embedding of the context as the (vector) average of the individual embeddings:

\[P(x_i | x_1, \ldots x_j \ldots) = \frac{\exp(u_i^T \text{Ave}(v_j))}{\sum_i \exp(u_i^T \text{Ave}(v_j))} .\]

Given this approximation of the conditional probabilities, the training is done just the same way as described above using the cross-entropy loss.

Which embedding is better? Both are roughly equivalent and we could use one or the other.

GloVe

The main problem with the word2vec framework is that both skip-gram and CBOW models rely on predicting output probabilities, and hence have to be trained with the cross-entropy loss.

For very large dictionaries, calculating cross-entropy can be troublesome: each gradient update requires computing softmaxes (and hence calculating all the outputs and marginalizing over them). Global vector (GloVe) embeddings resolve this in a slightly different manner. The idea is to use matrix factorization (a la PCA), and since it is not neural network-based we won’t go into too much detail here: take an NLP class if interested. The main steps are as follows:

We construct a word-context co-occurrence matrix and try to factorize it using PCA (i.e., find the low-rank decomposition that minimizes the reconstruction loss).
Not trivial, but this is a very sparse matrix! Can train using SGD type methods.
Word distributions have a long tail, so very common words will dictate the loss function. To make things more equitable, log-probabilities are used.
In practice, a modified weighted form of the reconstruction loss is used:

\[L(U,V,b,c) = \sum_{i,j} f(x_{ij}) (u_i^t v_j + b_i + c_j - \log x_{ij})^2\]

where $f(x_{ij}) = 1$ for reasonable $x_{ij}$ but quickly goes to 0 if $x_{ij}$ gets close to zero. This avoids the possibility that large (negative) values in the log-probabilities significantly influence the loss function.

ELMO, BERT, and GPT-2

While word2vec and GloVe represented step-changes in our ability to build sophisticated language models, they have now largely been surpassed by more modern techniques — ELMo, BERT, and GPT. Fortunately, we now have all the ingredients to understand them. The details are a bit hairy (a lot of engineering has gone into finetuning each of them) so we will only stick to high-level intuition and descriptions, while relegating the specifics to the textbook (Chapter 15).

ELMo

The main problem with GloVe/word2vec is that the word embeddings are context-independent. Recall that if we want to get a skip-gram embedding of a word, we one-hot encode it and look at the corresponding input- and output-layer weights in the above two-layer architecture.

However, words (particularly in languages such as English) are context-dependent. E.g. consider a sentence such as “Fish fish fish” — there are three identical words here, but the context shows that each has rather different meanings. Can we somehow get embeddings that not just look at word-level semantics but their usage in a given sentence?

ELMo (Embeddings from Language Models) does this. Just as how we motivated RNNs/LSTMs as possible architectures that can capture context in a sequence of inputs, similarly we can replace the simple feedforward architecture of word2vec with recurrent architectures.

Specifically, ELMo proposes to produce word embeddings by looking at the entire sentence both left-to-right and right-to-left. It achieves this via bi-directional LSTMs: it looks at the hidden layer representations (states) for both the left-to-right and right-to-left LSTMs and takes a weighted linear combination of them as the word embedding for each word in the input. The weights are left as trainable parameters used by downstream tasks (such as classification or sentence prediction) for further fine-tuning.

The choice of loss function is important; ELMo uses next-word-prediction (NWP) as the task of choice using the cross-entropy loss.

BERT

The next natural progression was to replace the bidirectional LSTM encoding used by ELMo with Transformers. This led to BERT (bidirectional encoder representations from transformers). The main ingredients (over and above those described above) include:

Replacing LSTMs with transformer blocks. The output of each encoder layer in each token’s path can be viewed as a feature embedding of that toen.

The loss function/training task used to learn the embeddings is for next-sentence prediction (NSP) which is shown to be transferable to many tasks.
To encourage generalizability, a technique called masked self-attention is used: random words in a sentence are masked/zeroed out. This is similar to Dropout, which we have seen in the context of training feedforward nets.
BERT also uses word piece tokenization, which is somewhere in between character-level and word-level encoding. This is useful for languages like English. For example, the world “Walking” is broken into two pieces: “walk” and “ing”, each of which are tokenized.
There are two BERT models, which differ in the depth (number of encoder blocks) used in the Transformer architecture.
BERT is now adopted by Google Search in most of their supported languages.

GPT-2

This line of work culminated in GPT-2 (GPT = Generative Pre-Training). Its successor (GPT-3) is possibly the most advanced language model currently present, but is closed-source.

A key difference with BERT is that GPT-2 uses masked auto-regressive self-attention, so tokens are not allowed to peek at words to the right of them.

GPT-2 also used much deeper architectures than BERT, and was trained on extremely massive datasets (called the OpenWebText Corpus).

Other hacks: similar to BERT, GPT uses word piece tokenization. GPT-2 used something called Byte Pair encodings that uses compression algorithms to figure out how to chop up regular words into tokens.

Summary

There you have it: a brief summary of modern neural architectures for NLP (and sequential data more broadly).

Among the many applications they support: apart from regular classification-type problems (such as sentiment analysis or named entity recognition), the above models support:

Language synthesis – as used by chatbots and the like.
Summarization: models such as GPT-2 can read a wikipedia article (without the intro paragraph) and be asked to summarize the intro.
Similar architectures can be fine-tuned to perform music synthesis (such as synthetic midi file generation).

Lecture 7: Transformers

2021-03-08T00:00:00+00:00

In which we introduce the Transformer architecture and discuss its benefits.

Attention Mechanisms and the Transformer

Motivation

Attention models/Transformers are the most exciting models being studied in NLP research today, but they can be a bit challenging to grasp – the pedagogy is all over the place. This is both a bad thing (it can be confusing to hear different versions) and in some ways a good thing (the field is rapidly evolving, there is a lot of space to improve).

I will deviate a little bit from how it is explained in the textbook, and in other online resources: see Section 10 in the textbook for an alternative treatment.

Recall where we left off: general RNN models. They look like this:

{ width=90% }

We discussed some NLP applications that are suitable to be solved by RNNs. These include:

next symbol/token prediction
sequence classification

but there are several NLP applications for which RNN-type models are not the best. These include:

neural machine translation (NMT)
sentence generation

Consider, for example, the English sentence:

“How do you like the weather today”?

and its German translation:

“Wie finden sie das Wetter heute?”

While the two sentences are rather similar (both are Germanic languages) We find some subtle differences here. One is the difference in the number of words: the German version has one less word. The second is the order of the words – the pronoun “you” comes before the verb “like” in English but the pronoun “sie” after the verb “finden” in German. Both are examples of misalignment, and language translation has to frequently deal with small/local misalignments of this nature.

RNNs are not amenable to dealing with misalignments. The main reason is that RNNs (fundamentally) are sequence-to-symbol models: they output symbols one after the other based on the sequence seen so far. In NMT the outputs are not single tokens but sequences of tokens, each of which may depend on several parts of input sequence (both forwards and backwards in time) with long-range dependencies. How do we fix this problem? Let us consider a few different solution approaches.

Attempt 1. Model tokens as entire sentences, not words (i.e., build the language model at the sentence level, not at the word- or character-levels). This, of course, is not feasible – due to combinatorial explosion, the number of possible sentences becomes extremely large very quickly.

Attempt 2. A second approach is to use bidirectional RNNs. The idea is simple: read the input sequence both backwards and forwards in time. This way we will get two sets of hidden states. We can concatenate both states to decode the output. This is fine, but still does not capture very long range dependencies.

Attempt 3: Encoder-decoder architectures. Delay producing any output in the beginning. Just compute the states recursively until the last state (which is the “global” context/memory variable which captures the entire sequence). This is called the encoder. Then feed it to the input again to produce outputs. This is called the decoder. This is a fine idea but same issues with gradient vanishing, low ability of final state to capture overall context etc.

Attempt 4: Why only final state? Take all intermediate encoder states, store all of them as context vectors to be used by the decoder. This is getting better, but still too complex. There are encoder states, decoder states, decoder inputs \ldots getting way too complex. Also, it would be nice to figure out which parts of the input sequence influenced which other parts, so that we get a better understanding of the context. But how to assign “influence scores” systematically?

Self-Attention

This is the point where papers-blogs-tweets-slides etc start talking about keys/values and attention mechanisms and everything goes a bit haywire. Let’s just ignore all that for now, and instead talk about something called self-attention. The use of the “self-“ prefix will become clear later on.

Here is how it is defined. We have a set (not sequence, order does not matter right now) of input data points ${x_1, x_2, \ldots, x_n}$. They can all be $d$-dimensional vectors. We will produce a set of outputs ${y_1, y_2, \ldots, y_n}$, also $d$-dimensional vectors:

\[y_i = \sum_{j=1}^n W_{ij} x_j\]

i.e., each output is a weighted average of all inputs where the weights $W_{ij}$ are row-normalized such that they sum to 1.

Crucially, the weights here are not the same as the (learned) parameters in a neural network layer. Instead, they are derived from the inputs. For example, one option is that we choose the weights to be dot-products:

\[w_{ij} = x_i^T x_j\]

and apply the softmax function so that we get row-normalization:

\[W_{ij} = \frac{\exp{w_{ij}}}{\sum_j \exp{w_{ij}}}\]

and use these weights to construct the outputs. That’s basically self-attention in a nutshell. In fact, this is all we will need to understand transformers/BERT/GPT etc.

Notice a few fundamental differences between regular convnets/RNNs and the operation we discussed above:

Convnets map single inputs to single outputs. In self-attention, we map sets of inputs to sets of outputs, and by design, the interaction between data points is captured.
RNNs map inputs seen thus far to single outputs. In self-attention, we are not limited to tokens/symbols only seen in the past.
Until now, nothing is learnable here. This is an entirely deterministic operation with no free parameters. You can think of $x_i$ being features/embeddings that were learned “upstream” before being fed into the self-attention layer. We will add a few learnable parameters to the layer itself shortly.
Observe that the operation is permutation-equivariant: if I permute the order of $x$, the output of $y$ is exactly the same, but permuted. This can pose challenges in NLP where permutations may result in completely different meanings. We will fix this shortly.

Before we proceed, why does this operation even make sense?

One interpretation is as follows: suppose we restrict our attention to linear models (so the output has to be a linear combination of the inputs). Say we were performing an NMT task that was translating “The cat sat on the hat” from English to German. One could represent each word in this sentence with an embedding/token.

However, there is a lot of redundancy in natural languages. Certain words (the, on) are common words that are not informative/correlated. Other words (cat, hat) are similar (both nouns). Words may be grouped according to subject-object relationships or subject-predicate relationships. It would be useful if the model automatically “grouped” similar words together. That would allow both better context and better training. The dot product provides a mechanism for automatically figuring out this kind of grouping.

OK, now let’s generalize the self-attention operation a little bit.

In the above definition of the self-attention layer, observe that each data point $x_i$ plays three roles:

It is compared with all other data points to construct weights for its own output $y_i$ (i.e., in the dot-product example above, the sequence of weights $w_{i 1} = x_i^T x_1, w_{i 2} = x_i^T x_2, \ldots, w_{i n} = x_i^T x_n $).
It is compared with every other data point $x_j$ to construct weights for their output $y_j$ (i.e., the weight $w_{1i} = x_1^T x_i, w_{2i} = x_2^T x_i$, \ldots).
Once all the weights $w_ij$ have been constructed, it is used to finally synthesize each actual output $y_1, y_2, \ldots, y_n$.

These three roles are called the query, key, and value respectively. To make these roles distinct, let us add a few dummy variables:

\[\begin{aligned} q_i &= x_i, \\ k_i &= x_i, \\ v_i &= x_i \end{aligned}\]

and then write out the output as:

\[w_{ij} = q_i^T k_j, \qquad W_{ij} = \text{softmax}(w_{ij}), \qquad y_i = \sum_j W_{ij} v_j .\]

This is a lot of responsibility for each data point. Let’s make the life of each vector easier by adding learnable parameters (linear weights) for each these three roles. For numerical reasons, we also scale the dot-product (this does not change intuition at all).

Therefore, we get:

\[\begin{aligned} q_i &= W_q x_i, \qquad k_i = W_k x_i, \qquad v_i = W_v x_i \\ w_{ij} &= q_i^T k_j / \sqrt{d}, \qquad W_{ij} = \text{softmax}(w_{ij}), \qquad y_i = \sum_j W_{ij} v_j . \end{aligned}\]

We can think of each of the $W_q$, $W_k$, $W_v$ as learnable projection matrices that defines the roles of each data point.

One last complication. We can concatenate different self-attention mechanisms to give it more flexibility. This is the same analogy as choosing multiple filters in a convnet layer. This is called multi-head self-attention. We can index each head with $r = 1, 2, \ldots$, so that we get learnable parameters $W^r_q$, $W^r_k$, $W^r_v$. We get independent outputs for each head and then combine everything using a linear layer to produce the outputs. So we finally get:

\[\begin{aligned} q^r_i &= W^r_q x_i, \qquad k^r_i = W^r_k x_i, \qquad v^r_i = W^r_v x_i \\ w^r_{ij} &= \langle q^r_i, k^r_j \rangle / \sqrt{d}, \qquad W^r_{ij} = \text{softmax}(w^r_{ij}), \qquad y^r_i = \sum_j W^r_{ij} v_j, \\ y_i &= W_y \text{concat}[y^1_i, y^2_i, \ldots]. \end{aligned}\]

and there we have it. The entire (multi-head) self-attention layer. We will denote the above $x$-to-$y$ mapping as follows:

\[[y_1, y_2, \ldots, y_n] = \text{Att}([x_1, x_2, \ldots, x_n])\]

Quick back-story on the nomenclature. These names query, key, value come from a key-value data structure. If we give a query key and match it to a database of available keys, then the data structure returns the corresponding matched value. The analogy is similar in attention mechanisms, except that the matching is done via dot-products (and the softmax ensures that it is a soft-matching, and every key in the database is matched to the query to some extent).

This also relates to the name “self-attention”. Recall our original discussion in the beginning of this lecture when we started discussed encoder/decoder architectures. We had recurrent neural networks taking the input ${x_i}$ and doing complicated things to get encoder context vectors ${h_i}$ and decoder states $s_i$. Then we were computing “influence scores” to figure out which words were relevant for (or “attend to”) which output. One mechanism proposed for doing this was to compute dynamic context scores:

\[c_i = \sum_{j} \alpha_{ij} h_j\]

where $\alpha$ represented the alignment weights. This was called an attention mechanism, and early NMT papers used a shallow feedforward network (called an attention layer) to compute these alignment weights:

\[\alpha_{ij} = W_1 \text{tanh}(W_2 [h_i, s_j])\]

followed by a softmax. Notice the similarities between what we discussed so far and the above formulation. A seminal paper in 2017 called “Attention is all you need” dramatically simplified things and showed that self-attention is enough – you could interpret contexts quite well in NLP tasks if we just let the input data tokens attend to themselves.

Transformers

We now use the self-attention layer described above to build a new architecture called the Transformer. The Transformer architecture now forms the backbone of the most powerful language models yet built, including BERT and GPT-2/3.

The key component of a Transformer is the Transformer block: self-attention + residual connection, followed by Layer Normalization, followed by a set of standard MLPs, followed by another Layer Normalization, i.e., something like this:

Observe that this architecture is completely feedforward, with no recurrent units. Therefore, gradients do not vanish/explode (by construction), and the depth of the network is no longer dictated by the length of the input (unlike RNNs). Multiple transformer blocks can then be put together to form the transformer architecture.

Transformers: Wrapup

One part that we didn’t emphasize too much in the previous lecture is the fact that unlike sequence models (such as RNNs or LSTMs), self-attention layers are permutation-equivariant. This means that sentences of the form:

“Jack gave water to Jill”

and

“Jill gave water to Jack”

will learn the exact same features. In order to incorporate positional information, some more effort is needed.

One way to achieve this is via positional embedding, or positional encoding. We create, in addition to the word embedding, a vector that encodes the location of the token. This vector can either be learned (just as word embeddings – see below) or just fixed. The latter is typically used in Transformer architectures.

What kind of positional encodings are useful? One-hot encoding the position is possible (although quickly becomes cumbersome – can you reason why this is the case?). Just adding an integer feature encoding the position is fine too, although we may run into scale/dynamic range issues, sinc the value of the feature can become very large for one sequences. A common approach is to use sinuisoidal encoding:

\[p_t = [sin(\omega_1 t); sin(\omega_2 t); \ldots sin(\omega_d) t]\]

where $\omega_k = \frac{1}{10000^{k/d}}$ represents different frequencies. Thus the values of the positional encoding vector are always bounded, and because of the periodic nature of the definition this can be applied for any choice of $d$ and $t$.

Lecture 6: Recurrent Neural Nets

2021-03-07T00:00:00+00:00

In which we introduce deep networks for modeling time series data.

Recurrent Neural Networks

Thus far, we have mainly discussed deep learning in the context of image processing and computer vision. Let us now turn our attention to a different set of applications that involve text. For example, consider natural language processing (NLP), where the goal might be to:

perform document retrieval: used in database- and web-search;
convert speech (audio waveforms) to text: used in Siri, or Google Assistant;
achieve language translation: used in Google Translate,
map video to text: used in automatic captioning,

among a host of other applications.

Let us think about trying to use the tools we have developed so far to solve the above types of problems. Recall the kind of tools we have been using: thinking of data as real-valued vectors/arrays; representing entries of this array as nodes in a network; recursively applying arithmetic operations (organized in the form of layers); training the parameters of each layer; and so on.

Immediately we run into problems. For example, a document (or any other type of text object) is a string of characters, so how do we encode them into real-valued vectors? The naive approach would be to perform one-hot encoding of each character, just as how we encoded categorical labels in classification; but is this the best we can do? Should we instead try to model words, and if yes, then should we one-hot-encode words instead? Defining how to represent text is the first challenge.

Setting this question aside, a second challenge arises in the context of designing neural architectures for processing text data. If we think of representing the characters in a sentence into a linear vector/array, notice that the contents of the vector exhibits both short range as well as long-range dependencies. The short range dependencies encode relationships between characters in a word, or relationships between adjacent words; it is reasonable that one can capture this via a convnet.

But the long range dependencies are harder to model, and in a lot of languages the start of a sentence may have relevance to the end of a sentence. (Example: “The cow, in its full glory, jumped over the moon” – the subject and object are at two opposite ends of the sentence.) These kinds of non-local interactions are not easily captured by convnets, and therefore we need a new approach.

Markov and n-gram models

Before delving into neural nets for text processing, let us first discuss some classical methods. We will assume that text can be represented as a sequence of numerical symbols $w_1, w_2, \ldots$ where the symbols represent characters, words, or whatever model we define.

Classically, the tools to solve NLP problems were probabilistic language models. If we consider any sequence $w = (w_1,w_2,\ldots,w_d)$, then the goal would be to estimate the probability distribution:

\[P(w) = P(\{w_1, w_2, \ldots, w_T\})\]

From basic probability, we can factorize this distribution as:

\[P(w) = \Pi_{t=1}^T P(\{w_t | w_{t-1}, w_{t-2}, \ldots w_1\})\]

So the likelihood of any given sequence appearing depends on the conditional probability of a word given the appearance of the previous several words.

These probabilities, in principle, can be empirically estimated given a very large corpus of training data. However, in practice such estimates can be noisy (or even intractable, given the combinatorial explosion in the number of possible word combinations). To alleviate this, it is typical to make the (first-order) Markov model assumption, which states that the likelihood of each word only depends on the previous word in the sentence:

\[P(w) = P((w_1,w_2,\ldots,w_T)) = P(w_1) \cdot P(w_2 | w_1) \cdot \ldots P(w_T | w_{T-1}) .\]

Now the conditional probabilities are relatively easier to estimate: if we have $n$ words in the dictionary then we need to estimate roughly $O(n^2)$ probabilities. This is large but not intractable.

The first-order Markov assumption unfortunately ignores dependencies across time beyond a single hop. If we were being brave, we could extend it to two, or three, or $n$ previous words – these are called bigram, trigram, or n-gram models. But realize that as we introduce more and more dependencies across time, the probability computations quickly become large.

Recurrent architectures

An elegant way to resolve the time dependency issue and introduce long(er) range dependencies is via the notion of a latent variable called the state. We will rely on the following approximation:

\[P(\{w_t | w_{t-1}, w_{t-2}, \ldots w_1\}) \approx P(\{w_t | h_{t-1} \})\]

where $h_t$ is a hidden variable that approximately encodes all history up to the current instant. In general, we can assume that the state $h_t$ is a function of the previous state and the current input: $h_t = f(h_{t-1}, x_t)$.

Let us interpret this in the context of neural nets. Thus far, we have strictly used feedforward connections while discussing neural network architectures. Let us now introduce a new type of neural net with self-loops which acts on time series, called the recurrent neural net (RNN). In reality, the self-loops in the hidden neurons are computed with unit-delay, which really means that the state of the hidden unit at a given time step depends both on the input at that time step, and the state at the previous time step. The mathematical definition of the operations are as follows:

\[\begin{aligned} h^{t} &= \sigma(U x^{t} + W h^{t-1}) \\ y^{t} &= \text{softmax}(V h^{t}). \end{aligned}\]

So, historical information is stored in the output of the hidden neurons, across different time steps. We can visualize the flow of information across time by “unrolling” the network across time.

Observe that the layer weights $U, W, V$ are constant over different time steps; they do not vary. Therefore, the RNN can be viewed as a special case of deep neural nets with weight sharing.

Loss functions and metrics

Let us recall our three-step recipe for machine learning. Having defined a model (or a representation), we now have to define a goodness of fit. For text, there are a couple of options. The training loss is typically chosen as the cross-entropy (recall that we are trying to approximate the probability of an output symbol/token given previous inputs). So if $y^t$ is the predicted output and $g^t$ is the one-hot encoding of the ground truth, then we can write out:

\[l(y^t, g^t) = - \sum_i g_i^t \log y_i^t = - \log y_{I(g)}^t\]

where $I(g)$ is the index corresponding to the true word, and the overall loss is given by averaging over the entire training corpus:

\[L(\theta) = \frac{1}{T} \sum_{t=1}^T l(y^t, g^t) = - \frac{1}{T} \sum_t \log y_{I(g)}^t (\theta) .\]

In practice, this can be very hard to compute for large datasets, so this is broken down into batches (usually sentences). There are additional complications while computing gradients which we discuss below.

Evaluation of a given model is done via a quantity called perplexity, which happens to be related to the loss that we defined above. Perplexity is an information-theoretic concept that measures how well a probability model predicts a given object/symbol. It is defined as the exponent of the cross-entropy of the final model measured over the predictions made over a validation dataset:

\[\text{Perplexity} = \exp \left( - \frac{1}{T} \sum_t \log y_{I(g)}^t \right)\]

If there is a lot of certainty about what the model is predicting, then the probability distribution is peaked around the right output, the cross-entropy is 0, and the perplexity is 1. If the model is spitting out random words, the probability distribution is likely going to be uniform and the perplexity is going to be equal to the number of tokens in the vocabulary (exercise: why is this?). A good prediction model achieves lower perplexities.

Backpropagation through time

Again, training an RNN can be done using the same tools as we have discussed before: variants of gradient descent via backpropagation. The twist in this case is the feedback loop, which complicates matters. To simplify this, we simply unroll the feedback loop into $T$ time steps, and perform backpropagation through time for this unrolled (deep) network. We need to be careful when we apply the multivariate chain rules while computing the backward pass, but really it is all about careful book-keeping; conceptually the algorithm is the same.

Here is a more concrete description of the backprop updates. Let’s just ignore all matrix-vector multiplies (the calculus becomes complex) and just pretend that everything (input, output, hidden state) is a scalar. There are three sets of weights we need to figure out: the weights mapping input to the state ($u$), the weights mapping the state to itself ($w$), and the weights mapping the state to the output ($v$).

Remember that these weights are constant across time, so even if we unroll the network out to $T$ steps, there is a massive amount of weight-sharing going on. The chain rule gives us:

\[\begin{aligned} \frac{\partial L}{\partial w} &= \frac{1}{T} \sum_{t=1}^T \frac{\partial l^t}{\partial w} \\ &= \frac{1}{T} \sum_{t=1}^T \frac{\partial l^t}{\partial y^t} \frac{\partial y^t}{\partial h^t} \frac{\partial h^t}{\partial w} \end{aligned}\]

The first and second factors above are easy to calculate (it’s just the derivative of the cross-entropy and the soft-max). However, the last term is tricky. By definition,

\[h^t = \sigma(u x^{t} + w h^{t-1}) := f(w, h^{t-1})\]

Therefore, the derivative of $h^t$ with respect to $w$ has two components:

\[\frac{\partial h^t}{\partial w} = \frac{\partial f(w, h^{t-1})}{\partial w} + \frac{\partial f(w, h^{t-1})}{\partial h^{t-1}} \cdot \frac{\partial h^{t-1}}{\partial w} .\]

If we define a sequence $a_t := \frac{\partial h^t}{\partial w}$, then each $a_t$ depends on $a_{t-1}$, which in turn depends on $a_{t-2}$, and so on. This induces a recurrence relation for $a_t$. So to accurately compute gradients with respect to $w$, we need to perform backprop all the way to the start of time. In practice this is far too cumbersome and we usually just truncate after a certain number of time steps.

(Observe that this problem did not come up in regular feed-forward networks – the gradients at any layer only depended on the forward pass activations and the backward pass messages at that layer).

Even more troubling is the fact there is a multiplicative factor linking the terms $a_t$ and $a_{t-1}$. This has the effect of a geometric series: if the factor is greater than one on average across time, then the gradients explode, while if the factor is lesser than one on average across time, then the gradients vanish.

Stabilizing RNNS training and extensions

Vanishing/exploding gradients are a major headache in deep learning, and are even more pertinent in RNNs (which, by design, require unrolling over several time steps). One way to solve this problem is called gradient clipping where we simply ignore the magnitude of the gradient and normalize it to some norm $\alpha$ that is kept constant:

\[g \leftarrow \alpha \frac{g}{\|g\|} .\]

As you can imagine this is sub-optimal since it may lead to erroneous gradient updates. But at least the numerics are stable.

The alternative approach is to redesign the architecture itself. Notice the above example is for a single-layer RNN (which itself – let us be clear — is a deep network, if we imagine the RNN to be unrolled over time). We could make it more complex, and define a multi-layer RNN by computing the mapping from input to state to output itself via several layers. The equations are messy to write down so let’s just draw a picture:

Depending on how we define the structure of the intermediate layers, we get various flavors of RNNs:

Gated Recurrent Units (GRU) networks
Long Short-Term Memory (LSTM) networks
Bidirectional RNNs

and many others. This gives us a lot of flexibility as to how to ensure that the gradient information propagates across several time steps.

LSTMs are the most well-known among the above architectures, but GRU’s are a bit simpler to explain formally so let’s do that (refer to the textbook for LSTMs if you are interested). The idea is similar: we interpret the state as the memory of a recurrent unit, and hence would like to also somehow decide whether certain units are worth memorizing (in which case the state is updated), and others are worth forgetting (in which case the state is reset). Let us define two gating operations, called “reset” ($r$) and “update” ($z$):

\[r^t = \sigma(U_r x^t + W_r h^{t-1}), z^t = \sigma(U_z x^t + W_z h^{t-1})\]

which both look like a regular state-update equation. Now, ordinarily in an RNN, as we discussed in the beginning of this lecture, we would update the state as:

\[h^{t} = \sigma(U x^{t} + W h^{t-1}) .\]

But in the GRU, we define the candidate state as:

\[\tilde{h}^{t} = \sigma\left(U x^{t} + W (h^{t-1} \odot r^t)\right)\]

with the intuition being that if the reset gate is close to 1, then this looks like a regular RNN unit (i.e., we retain memory), while if the reset gate is close to 0, then this looks like a regular perceptron/dense layer (i.e., we forget).

Now, the update gate tells us how much memory retention versus forgetting needs to happen:

\[h^t = h^{t-1} \odot z^t + \tilde{h}^{t} \odot (1 - z^t) .\]

Whenever the update gate is close to one, we retain the old state; whenever it is close to zero, the state is over-written.