The Rise of Diffusion Models
Recently, diffusion models have gained significant traction, evidenced by the rapidly expanding user base of platforms like stable diffusion, midjourney, and DALLE-2. In this article, I’ll delve into the intricacies of how diffusion models operate. My aim is to present this content in a manner that minimizes the need for prior expertise in diffusion models or mathematics.
Understanding Diffusion Models with a 1D Illustration
Introduced in 2020 through the paper titled Denoising Probabilistic Diffusion Models(DDPM), the diffusion model has demonstrated superior performance compared to GANs. This model comprises a forward process, which incrementally obliterates information, and a backward process that undoes these changes. To elucidate its workings, we’ll commence with a 1D example. Suppose we have an input $X$ which adheres to a multifaceted distribution. We iteratively add Gaussian noise $u_t$ to $X$ following this equation $X_t = \sqrt{1 - p} X_{t-1} + \sqrt{p} \mu_t$, where $\mu \sim \mathcal{N}(0, 1)$ and $p = 0.01$. Observe that the output at each step is a weighted blend of the preceding step and Gaussian noise. The resulting output after $T$ steps can be expressed as: $$ \begin{aligned} X_T &= \sqrt{1 - p} X_{T-1} + \sqrt{p} \mu_t \\ &= \sqrt{1 - p} (\sqrt{1 - p} X_{T-2} + \sqrt{p} \mu_{T-1}) + \sqrt{p} \mu_t \\ &= \ldots \\ &= (\sqrt{1 - p})^T X_0 + \sum_{i=0}^{T-1} \sqrt{p}(\sqrt{1 - p})^{i} \mu_{T-i} \end{aligned} $$
As the number of steps increases and $T$ approaches infinity, the first term mentioned will converge to zero $$(\sqrt{1 - p})^T \xrightarrow[T \to \infty]{} 0$$. For the second term, since we introduce independent Gaussian noise at every step, the final variance can be derived by summing the variances of all these individual Gaussians. It is given by: $$\sum_{i=0}^{T-1} p(1 - p)^{i} \mu_{T-i} = p \cdot \frac{1 - (1 - p)^T}{1 - (1 - p)} \xrightarrow[T \to \infty]{} 1$$. This implies that given a sufficient number of steps, we can transform any distribution $X$ into a standard normal distribution.
For images sized 512 x 512, the diffusion process can be applied to each RGB channel of every pixel, transforming them into isotropic Gaussian noise. While diffusion can be used to convert intricate distributions into isotropic Gaussian ones, our primary interest lies in its inverse: transitioning the isotropic Gaussian distribution back to the original target distribution, typically where our training data originates. In essence, the diffusion model’s core concept involves training from the diffusion process and then utilizing the model to reverse it.
Decomposing the Markov Diffusion Process in DDPM
The DDPM algorithm, depicted below, models the diffusion process as a Markov sequence where Gaussian noise is progressively added to an image. Given the Markov property, each step relies solely on its predecessor. Thus, the Gaussian transition, also known as the diffusion kernel, is represented as $$q(\textbf{x}_t \vert \textbf{x}_{t-1}) = \mathcal{N}(\textbf{x}_t; \sqrt{1 - \beta_t} \textbf{x}_{t-1}, \beta_t\textbf{I})$$. Analogous to the 1D scenario, we can iteratively express $X_t$ using $X_0$:
$$ \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\epsilon_{t-1} & \text{ ;where } \epsilon_{t-1}, \epsilon_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-2} & \text{ ;where } \bar{\epsilon}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon \\ \end{aligned} $$
The ability to combine different Gaussians is akin to the 1D scenario, given their independence.
For clarity in the LaTeX expressions, we will represent vectors with regular symbols instead of using bold font.
Leveraging this property, instead of a sequential diffusion process, we can directly compute the perturbed image at any step $X_t$ using: $$q(x_t \vert x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) \quad \text{where} \quad \alpha_t = 1 - \beta_t \quad \text{and} \quad \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i$$. The probability of a given forward trajectory is denoted by $$q(x_{0:T}) = q(x_0)\prod_{i=0}^{T} q(x_t \vert x_{t-1})$$.
Loss Upper Bound for DDPM Models
Given small timesteps, the reverse process can be approximated as a Markov chain with Gaussian transition probability. We use a neural network to estimate the reverse process $q(x_{t-1}\vert x_t)$. We represent the model’s predicted transition probability as $$p_\theta(x_{t-1}\vert x_t)=\mathcal{N}(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta (x_t, t))$$. Our goal is to have the data sampled from Gaussian noise after this reverse process align closely with the target distribution, which means minimizing the KL divergence between $q(x_0)$ and $p_\theta(x_0)$.
$$
\begin{aligned}
\mu_\theta^*, \Sigma_\theta^* &= \arg\min_{\mu_\theta, \Sigma_\theta} \left( D_{KL}(q(x_0) \vert | p_\theta(x_0)) \right) \\
&=\arg\min_{\mu_\theta, \Sigma_\theta}\left( -\int q(x_0) log\left( \frac{p_\theta(x_0)}{q(x_0)} \right) dx_0 \right) \\
&=\arg\min_{\mu_\theta, \Sigma_\theta}\left( -\int q(x_0) log\left( p_\theta(x_0) \right) dx_0 \right)
\end{aligned}
$$
We can utilize the right-hand side as our training loss function. This is also the loss when maximizing the log likelihood with the estimated $p_\theta(x_0)$. Obtaining $p_\theta(x_0)$ involves starting from $x_T$ and integrating over all potential backward paths, so $$p_\theta(x_0) = \int p_\theta(x_{0:T})dx_{1:T}$$. By inserting the posterior distribution of the forward transition kernel, we find $$log(p_\theta(x_0)) = log\left(\int \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} q(x_{1:T}\vert x_0)dx_{1:T} \right)$$.
Using the Jensen inequality, we obtain:
$$
log\left(\int \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} q(x_{1:T}\vert x_0)dx_{1:T} \right) \ge \int log\left( \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} \right) q(x_{1:T}\vert x_0)dx_{1:T}
$$
This leads us to define an upper bound $\bar{L}$ for the loss $L$:
$$
\begin{aligned}
L &= -\int q(x_0)log\left(\int \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} q(x_{1:T}\vert x_0)dx_{1:T} \right) dx_0 \\
&\le -\int q(x_0) \left( \int log \left(\frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)}\right) q(x_{1:T}\vert x_0)dx_{1:T} \right) dx_0\\
&\le -\int log\left( \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} \right) q(x_{0:T})dx_{0:T} \\
&=\mathbb{E}_{q(\mathbf{x}_{0:T})}\left(-log \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} \right) \\
& = \bar{L}
\end{aligned}
$$
Rather than optimizing $L$, we focus on optimizing $\bar{L}$:
$$
\small{
\begin{aligned}
\bar{L} &= \mathbb{E}_{q(\mathbf{x}_{0:T})}-\log\left( \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)} \right) \\
&= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left ( -\log\frac{p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \vert x_t)}{q(x_1 \vert x_0) \prod_{t=2}^T q(x_t \vert x_{t-1}, x_0)} \right ) \\
&= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left ( -\log\frac{p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \vert x_t)}{q(x_1 \vert x_0) \prod_{t=2}^T \left( \frac{q(x_{t-1} \vert x_t, x_0)q(x_t \vert x_0)}{q(x_{t-1} \vert x_0)} \right)} \right ) \\
&= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left ( -\log\frac{p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \vert x_t)}{q(x_T \vert x_0) \prod_{t=2}^T q(x_{t-1\vert x_t, x_0})} \right ) \\
&= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left ( -\log\frac{p(x_T)}{q(x_T\vert x_0)} - \sum_{t=2}^T \log \frac{p_\theta(x_{t-1}\vert x_t)}{q(x_{t-1}\vert x_t, x_0)} - \log p_\theta(x_0 \vert x_1) \right) \\
&= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left ( D_{KL}(q(x_T\vert x_0)|p(x_T)) + \sum_{t=2}^TD_{KL}(q(x_{t-1}\vert x_t, x_0)| p_\theta(x_{t-1}\vert x_t)) - \log p_\theta(x_0\vert x_1) \right)
\end{aligned}}
$$
$\bar{L}$ consists of three terms. The first term, independent of the neural network parameters $\theta$, can be minimized by selecting sufficiently large steps, ensuring the final denoised data distribution resembles an isotropic gaussian distribution. While the last term is typically negligible, it can be represented using a distinct neural network if desired. The second term, the most significant, can be articulated as a sum of KL divergences.
The posterior distribution of the diffusion kernel, given $x_0$, is defined analytically as:
$$
\scriptsize{
\begin{aligned}
q(x_{t-1}\vert x_t, x_0) &= q(x_t\vert x_{t-1}, x_0) \frac{q(x_{t-1}\vert x_0)}{q(x_t \vert x_0)} \\
&\propto \exp\left( -\frac{1}{2}\left( \frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{\beta_t} + \frac{(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}}x_0)^2}{1-\bar{\alpha}_{t-1}} + \frac{(x_t-\sqrt{\bar{\alpha}_t}x_0)^2}{1-\bar{\alpha}_t} \right ) \right) \\
&= \exp\left( -\frac{1}{2}\left( \frac{x_t^2-2\sqrt{\alpha_t}x_t {\color{Green} x_{t-1}} + \alpha_t{\color{Red} x_{t-1}^2}}{\beta_t} + \frac{{\color{Red} x_{t-1}^2}-2\sqrt{\bar{\alpha}_{t-1}}x_0{\color{Green} x_{t-1}} + \bar{\alpha}_{t-1}x_0^2}{1-\bar{\alpha}_{t-1}} + \frac{(x_t-\sqrt{\bar{\alpha}_t}x_0)^2}{1-\bar{\alpha}_t} \right ) \right) \\
&= \exp\left( -\frac{1}{2}\left({\color{Red} (\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}})x_{t-1}^2} - {\color{Green} (\frac{2\sqrt{\alpha_t}}{\beta_t}x_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}x_0)x_{t-1}} + C(x_t, x_0) \right)\right)
\end{aligned}}
$$
Consider the following definitions: $$ \hat{\beta_t} = \frac{1}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}})} =\frac{1}{\frac{\alpha_t-\bar{\alpha}_t+\beta_t}{\beta_t(1-\bar{\alpha}_{t-1})}} =\frac{1 - \bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t $$ $$ \begin{aligned} \hat{\mu}_t(x_t, x_0) &= \frac{\frac{\sqrt{\alpha_t}}{\beta_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}x_0}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar{\alpha}_{t-1}})}\\ &=(\frac{\sqrt{\alpha_t}}{\beta_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}x_0)\frac{1 - \bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t\\ &=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1}))}{1-\bar{\alpha_t}}x_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x_0 \end{aligned} $$ Given these, the distribution is represented as $q(x_{t-1}\vert x_t, x_0) = \mathcal{N}(x_{t-1}; {\color{Green} \hat{\mu}(x_t, x_0)}, {\color{Red} \hat{\beta_t}I})$.
With the relation $x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_t)$, the mean transforms as: $$ \begin{aligned} \hat{\mu}_t(x_t, x_0) &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_t) \\ &= \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_t \right) \end{aligned} $$ This illustrates that the mean can be expressed in terms of the introduced error.
Consider the crucial term in our loss upper bound, $$\mathbb{E}_{q(\mathbf{x}_{0:T})} \left (\sum_{t=2}^TD_{KL}(q(x_{t-1}\vert x_t, x_0)| p_\theta(x_{t-1}\vert x_t)) \right)$$. Given that both $q$ and $p_\theta$ are Gaussian, their KL divergence can be analytically determined.
KL Divergence Between Two Gaussian Distributions
Consider two Gaussian distributions, $p$ and $q$, having the same covariance matrices:
$$ \begin{align*} \log q(x) &= -\frac{1}{2} \left( (x - \hat{\mu})^T \Sigma^{-1} (x - \hat{\mu}) \right) \\ \log p(x) &= -\frac{1}{2} \left( (x - \mu_\theta)^T \Sigma^{-1} (x - \mu_\theta) \right) \\ \log q(x) - \log p(x) &= -\frac{1}{2} \left( (x - \hat{\mu})^T \Sigma^{-1} (x - \hat{\mu}) - (x - \mu_\theta)^T \Sigma^{-1} (x - \mu_\theta) \right) \\ &= -\frac{1}{2} \left( x^T \Sigma^{-1} x - 2x^T \Sigma^{-1} \hat{\mu} + \hat{\mu}^T \Sigma^{-1} \hat{\mu} \right) \\ &+ \frac{1}{2} \left( x^T \Sigma^{-1} x - 2x^T \Sigma^{-1} \mu_\theta + \mu_\theta^T \Sigma^{-1} \mu_\theta \right) \\ &= x^T \Sigma^{-1} (\hat{\mu} - \mu_\theta) - \frac{1}{2} \left( \hat{\mu}^T \Sigma^{-1} \hat{\mu} - \mu_\theta^T \Sigma^{-1} \mu_\theta \right) \end{align*} $$
Consider that the expectation of ${x_T}$ under $q(x)$ is given by ${\hat{\mu}^T}$, and $\Sigma = \hat{\beta}_tI$. Using these, we can plug into the KL divergence:
$$ \begin{align*} D_{KL}(q||p) &= \int q(x) \log \left( \frac{q(x)}{p(x)} \right) dx \\ &= \int q(x) \left( \log q(x) - \log p(x) \right) dx \\ &= \int q(x) \left( x^T \Sigma^{-1} (\hat{\mu} - \mu_\theta ) - \frac{1}{2} \left( \hat{\mu}^T \Sigma^{-1} \hat{\mu} - \mu_\theta^T \Sigma^{-1} \mu_\theta \right) \right) dx \\ &= \hat{\mu}^T \Sigma^{-1}(\hat{\mu} - \mu_\theta) -\frac{1}{2} \left(\hat{\mu}^T \Sigma^{-1} \hat{\mu} - \mu_\theta^T \Sigma^{-1} \mu_\theta \right) \\ &= \frac{1}{2\hat{\beta}_t}\left( \hat{\mu}_T\hat{\mu} + \mu_\theta^T\mu_\theta - 2 \hat{\mu}^T\mu_\theta \right) \\ &= \frac{1}{2\hat{\beta}_t} \left( \| \hat{\mu} - \mu_\theta \|^2 \right) \end{align*} $$
Simplified Loss Equation for DDPM Models
We aim to minimize the L2 distance between the means of two Gaussian distributions. Initially, our goal is for $\mu_\theta(x_t, t)$ to approximate $\hat{\mu}(x_t, x_0)$, given by $ \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_t \right) $. Instead of directly predicting $ \hat{\mu}(x_t, x_0) $, the model can be reformulated to estimate $ \epsilon_t $. Thus, $ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) $. Through this reparameterization, our model focuses on predicting errors.
The loss could be simplied to $$ \begin{aligned} L &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left (\sum_{t=2}^TD_{KL}(q(x_{t-1}\vert x_t, x_0)| p_\theta(x_{t-1}\vert x_t)) \right) \\ &=\mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[ \frac{1}{2\hat{\beta}_t} \| \hat{\mu}_t(x_t, x_0) - \mu_{\theta}(x_t, t) \|^2 \Big] \\ &=\mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[ \frac{1}{2\hat{\beta}_t} \| \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_t \right) - \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \|^2 \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[\frac{(1-\alpha_t)^2}{2\hat{\beta}_t\alpha_t(1-\bar{\alpha}_t)} \| \epsilon_t - \epsilon_{\theta}(x_t, t) \|^2 \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[ \frac{(1-\alpha_t)^2}{2\hat{\beta}_t\alpha_t(1-\bar{\alpha}_t)} \| \epsilon_t - \epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t) \|^2 \Big] \end{aligned} $$. The original DDPM paper demonstrates improved empirical outcomes by omitting the weights in the equation, leading to the final loss $$ \begin{aligned} L^{\text{simple}} = \mathbb{E}_{t \sim [1, T]}\Big[ \| \epsilon_t - \epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t) \|^2 \Big] \end{aligned} $$.
Training & Sampling Algorithm
Using the simplified loss, the training process is straightforward. The neural network modeling the injected noise is versatile and can handle diverse data types. For images, a U-Net architecture is commonly employed. Training involves iterative sampling of triplets: <input image, step, noise> followed by gradient descent based on the loss function.