Intro
In the ever-evolving landscape of image generation models, diffusion-related approaches have surged in popularity, accompanied by other notable advancements. Models like Stable Diffusion, Imagen, and DALL-E2 have captured attention for their remarkable capabilities. In this post, I will show how the score based generative models are related to all these state-of-art generative models. The discussion here will encompass a comparative analysis of diffusion models with score-based generative models, showcasing how diffusion techniques can be effectively incorporated into the score-based framework. Additionally, I will explore intriguing possibilities where classifiers can guide the diffusion process, allowing for the generation of image samples based on specific class labels or textual inputs. After reading through the post, you will have a preliminary understanding of the principles behind these cutting-edge techniques and their potential applications in various fields. Many of the illustrations and equations listed here come from this post written by the primary author of the score-based generative model paper.
Generative Model History
In the domain of machine learning, models fall into two main categories: generative and discriminative. A model that determines whether an image shows a dog or a cat is discriminative, whereas a model that generates a lifelike image of a dog or cat is generative. Representative discriminative models include Logistic regression, SVM, random forest, and DNN. On the other hand, typical generative models consist of autoencoders, GANs, and diffusion models.
The underlying principle for all generative models is to transform a basic distribution, commonly a gaussian distribution, into a more complex target distribution. This transformation simplifies the task of drawing samples from the complex distribution to merely sampling from a gaussian distribution. As an illustration, if we have a set of images, the distribution from which our image dataset originates can serve as the target distribution. To put it more precisely, if our basic gaussian distribution is denoted as
Autoencoder
Autoencoders represent an early popular generative model. They comprise two components: an encoder, which transforms input data into a latent space, and a decoder, which reverts this latent representation back to an output resembling the original input data. One can also interpret autoencoders as dimensionality reduction techniques, given that the latent space typically has significantly fewer dimensions than the input. This latent space acts as the architecture’s bottleneck, ensuring only the most import information is passed through and subsequently reconstructed. However, traditional autoencoders encode input data to arbitrary locations in the latent space, complicating the generation of realistic images upon decoding from sampled latent points. To address these limitations, the Variational Autoencoder (VAE) was introduced.
The VAE’s innovation is its attempt to ensure encoded data within the latent space adheres to a standard normal distribution. This is accomplished by incorporating a KL-divergence between the encoded data and a standard normal distribution into the loss function. Thus, the loss consists of two components: a reconstruction loss (typically MSE, as with traditional autoencoders) and a KL-divergence loss. The resulting encoded data then aligns with a standard normal distribution. Such an approach guarantees that: 1) closely situated points in the latent space produce similar decoding outputs, and 2) any point sampled from the standard normal distribution yields meaningful decoded outputs. Nonetheless, the VAE often produces somewhat blurred results, likely due to inherent interpolations between various images.
GAN
Generative Adversarial Networks (GANs) are a renowned class of generative models, characterized by the interplay between two main components: the generator and the discriminator. Introduced in 2014, the generator uses random noise to produce lifelike images, while the discriminator’s role is to distinguish between fake images crafted by the generator and authentic ones. The loss function for the discriminator is the BCE loss, whereas the generator’s loss aims to maximize the log likelihood across all fabricated samples. For optimal image generation with GANs, both the generator and the discriminator must be proficient by the end of training. Over the past decade, GANs have demonstrated the capability to create strikingly realistic images, marking a significant milestone in deep learning. However, they come with two primary challenges: 1) the complexity of training the model, and 2) potential mode collapse issues, causing the generated images to lack diversity, meaning the model may only replicate a narrow segment of the input data distribution, rather than its entirety.
Diffusion Model
The diffusion model, introduced in 2020, outperforms GAN in generating results. For an in-depth understanding of the diffusion model’s functionality, please consult my earlier article.
Score Based Generative Models
In generative modeling, the goal is to mirror the complex distribution of real data. Using a deep neural network (DNN) is a natural strategy to achieve this. The intention is to utilize the DNN to model a complex probability distribution
The primary concept of the score based generative modeling paper revolves around score functions. The picture below represents a density function alongside the score function for a combination of two Gaussian distributions. The density function is visually represented with varying shades, with a darker shade signifying a denser region. The score function, on the other hand, represents a vector field indicating the direction of the steepest increase in the density function. Given the density function, deducing the score function becomes straightforward by merely computing the derivative. Similarly, knowing the score function enables the retrieval of the density function, essentially, by calculating integrals. Hence, the score function and the probability distribution are interchangeable in their roles.
Revisiting the challenge related to the normalizing constant, it becomes evident that when the gradient of the probability function is computed, the normalization constant turns to zero because it does not rely on the variable
Score Matching
Instead of using a DNN to represent
In the score matching algorithm, we are given a set of samples from our input data distribution
Langevin Dynamics
After determining the estimated score function, we must devise a method to construct our generative model by generating new data points from the given vector field of score functions. A potential strategy is to shift these points according to the directions suggested by the score function. Nevertheless, this won’t yield valid samples as the points will ultimately converge. This challenge can be circumvented by adhering to a noisy rendition of the score function. In essence, we aim to introduce Gaussian noise into our score function and pursue those noise-distorted score functions. This technique is widely recognized as Langevin dynamics. More formally, have the following algorithm.
Our goal is to sample from

Directly applying score matching plus langevin dynamics does not give good results. This is because in low data density regions, the score function and the estiamted score function are not accurate, so it will be hard for Langevin Dynamics to navigate through those low-density regions.

Noise Perturbed Score Matching + Langevin Dynamics
To solve this challenge, we can inject Gaussian noise to perturb our data points. After adding enough Gaussian noise, we perturb the data points to everywhere in the space. This means the size of low data density regions becomes smaller. So in the context of image generation, adding additional Gaussian noise means we inject Gaussian noise to perturb each pixel of the image. For this toy example in the picture below, you can see that, after injecting the right amount of Gaussian noise, the estimated scores become accurate almost everywhere. But simply injecting Gaussian noise will not solve all the problems. Because of perturbation of data points, those noisy data distances are no longer good approximations to the original true data density.

To solve this problem, we can use a multiple sequence of different noise levels. Here we use Gaussian noise with mean 0 and standard deviation from

Noise Conditional Score Networks
To estimate the score function, instead of training separate score models on different noise levels, we use a single noise conditional score model, which takes noise level

With the new algorithm, Song et al. show that they are able to create realistic and diverse images on the CIFAR10 dataset in the Generative Modeling by Estimating Gradients of the Data Distribution paper. When compared with the best approach in terms of FID scores and inception scores, this algorithm is able to outperform the best GAN approach.
Relationship with Other Models
Use Ancestral Sampling from DDPM for SMLD
Ancestral sampling is a technique used in probabilistic graphical models that consists of two steps: 1) we sort the nodes in a Bayesian network according to some ordering 2) following this order, we sample at each node according to the conditional probability given its parents’ values. The DDPM model exactly uses such an ancestral sampling technique. Recall that in DDPM’s forward process, each step’s output is generated by adding gaussian noise to the step’s input. And in DDPM’s backward process, we iteratively remove the noise from the sample to generate the final high-quality output. Song et al. show that we can adopt the same ancestral sampling framework from DDPM to the SMLD algorithm.
Let’s assume that in SMLD, we use a series of gaussian noises with increasing variance
Convert SMLD to SDE
If we increase the number of noise perturbation steps to infinity, the noise perturbation process is a continuous stochastic process. And stochastic processes(diffusion processes in particular) are solutions of stochastic differetial equations(SDEs).

Generally, a SDE has the following form
