Diffusion models are becoming popular. This note mainly tries to present a step-by-step derivation of diffusion models. Hope you will understand this technique with less effort on those equations :)
This post will be maintained to catch some state-of-the-art diffusion models. I'd appreciate any suggestions and bug reports: jishang [at] cs &dot& stony brook %dot% edu.
$$ p_\theta(x_{t-1}|x_t), \quad t=T \dots 1. $$ Here \(\theta\) means we will use a model (some neural networks) to estimate such probability distribution, since this distribution is unknown.
\(\mathcal{L}_t\) has two terms. We first focus on \(q\left(x_{t-1}|x_t, x_0\right)\) because we know the forward process. \(p_\theta\left( x_{t-1} | x_t \right)\) is from our model output, which will then be compared to \(q\left(x_{t-1}|x_t, x_0\right)\). $$ \begin{align} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1}, x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{q(x_t|x_{t-1}, \enclose{updiagonalstrike}{x_0})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ \end{align} $$ We have already know \(q(x_t|x_{t-1}) \sim \mathcal{N}(\sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\mathbf{I})\) by definition, and now we need to derive \(q(x_{t-1}|x_0)\) and \(q(x_t|x_0)\), which are the same thing: $$ \begin{align} \alpha_t &:= 1 - \beta_t \quad \text{and} \quad \bar{\alpha}_t := \prod_{i=1}^{t}\alpha_t, \quad \color{gray}\text{auxiliary notations} \\ x_T &= \sqrt{1-\beta_T} x_{T-1} + \sqrt{\beta_T}\epsilon_T \\ &= \sqrt{a_T} x_{T-1} + \sqrt{1-\alpha_T} \underbrace{\epsilon_{T}}_{\mathcal{N}(0, \mathbf{I})} \quad \quad \quad \quad \\ &= \sqrt{a_T}\sqrt{\alpha_{T-1}}x_{T-2} + \sqrt{\alpha_T}\sqrt{1-\alpha_{T-1}}\underbrace{\epsilon^*_{T-1}}_{\mathcal{N}(0, \mathbf{I})} + \sqrt{1-\alpha_T}\epsilon_{T} \\ &= \sqrt{\alpha_T \alpha_{T-1}}x_{T-2} + \underbrace{( \sqrt{\alpha_T\left(1-\alpha_{T-1}\right)}\epsilon^*_{T-1} + \sqrt{1-\alpha_T}\epsilon_{T}) }_{\text{merge two Gaussian w/ reparameterization}} \\ &= \sqrt{\alpha_T \alpha_{T-1}}x_{T-2} + \sqrt{\alpha_T\left(1-\alpha_{T-1}\right)+1-\alpha_T}\underbrace{\epsilon_{T-1}}_{\mathcal{N}(0, \mathbf{I})} \\ &= \sqrt{\alpha_T \alpha_{T-1}}x_{T-2} + \sqrt{1-\alpha_T\alpha_{T-1}}\epsilon_{T-1} \\ &= \cdots \\ &= \sqrt{\alpha_T\alpha_{T-1}\dots\alpha_1}x_0 + \sqrt{1-\alpha_T\alpha_{T-1}\dots\alpha_1}\epsilon_1 \\ &= \sqrt{\bar{a}_T} x_0 + \sqrt{1-\bar{a}_T} \epsilon_1\\ q(x_t|x_0) &\sim \mathcal{N}(x_t; \sqrt{a_t}x_0, (1-\bar{a}_t\mathbf{I})) \\ \end{align} $$
Then go back to \(\frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)}\): $$ \begin{align} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \frac{\mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, \left(1-\alpha_t\right)\mathbf{I}) ~ \mathcal{N}(x_{t-1};\sqrt{\bar{\alpha}_{t-1}}x_0, \left(1-\bar{\alpha}_t\right)\mathbf{I}) } {\mathcal{N}(x_t; \sqrt{\bar{\alpha_t}}x_0, (1-\bar{\alpha}_t)\mathbf{I})} \\ &\propto \exp \left( -\left[ \frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{2(1-\alpha_t)} + \frac{(x_{t-1}-\sqrt{1-\alpha_{t-1}}x_0)^2}{2(1-\bar{\alpha}_{t-1})} - \frac{(x_t-\sqrt{\bar{\alpha_t}}x_0)^2}{2(1-\bar{\alpha}_t)} \right] \right) \\ &= \exp\left( -\frac{1}{2} \left[ \frac{{\color{red}x^2_t} + \alpha_t x^2_{t-1} - 2\sqrt{\alpha_t}x_{t-1}x_{t}}{1-\alpha_t} + \frac{x^2_{t-1}+{\color{red}\bar{\alpha}_{t-1}x^2_0}-2\sqrt{\bar{a}_{t-1}x_0x_{t-1}}}{1-\bar{\alpha}_{t-1}} - {\color{red} \frac{(x_t-\sqrt{\bar{\alpha_t}}x_0)^2}{1-\bar{\alpha}_t} }\right] \right) \\ &= \exp\left( -\frac{1}{2} \left[ \frac{\alpha_t x^2_{t-1} - 2\sqrt{\alpha_t}x_{t-1}x_{t}}{1-\alpha_t} + \frac{x^2_{t-1}-2\sqrt{\bar{a}_{t-1}x_0x_{t-1}}}{1-\bar{\alpha}_{t-1}} - {\color{red} C(x_t, x_0) }\right] \right) \\ &\propto \exp\left( -\frac{1}{2} \left[ \left(\frac{\alpha_t}{1-\alpha_t} + \frac{1}{1-\bar{\alpha}_{t-1}} \right)x^2_{t-1} + \left( \frac{-2\sqrt{\alpha_t}x_t}{1-\alpha_t} + \frac{-2\sqrt{\bar{\alpha}_{t-1}}x_0}{1-\bar{\alpha}_{t-1}} \right) x_{t-1} \right] \right) \\ &= \exp \left( -\frac{1}{2(1-\alpha_t)(1-\bar{\alpha}_{t-1})} \left[ \left( \alpha_t(1-\bar{\alpha}_{t-1})+1-\alpha_t \right) x^2_{t-1} -2 \left( (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}x_0 \right)x_{t-1} \right] \right) \\ &= \exp \left( -\frac{1}{2(1-\alpha_t)(1-\bar{\alpha}_{t-1})} \left[ \left( 1-\bar{\alpha}_{t} \right) x^2_{t-1} -2 \left( (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}x_0 \right)x_{t-1} \right] \right) \\ &= \exp \left( -\frac{1-\bar{\alpha}_{t}}{2(1-\alpha_t)(1-\bar{\alpha}_{t-1})} \left[ x^2_{t-1} -2 \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}x_0} {1-\bar{\alpha}_{t}} x_{t-1} \right] \right) \\ &\propto \mathcal{N} (x_{t-1}; \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}{\color{red}x_0}} {1-\bar{\alpha}_{t}}, \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}\mathbf{I}) \\ & \mu_q(x_t, x_0) := \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}{\color{red}x_0}} {1-\bar{\alpha}_{t}} \end{align} $$ Thus, \(q(x_{t-1}|x_t, x_0)\) is a Gaussian distribution. However, \(x_0\) is unknown, so we need to transform \(x_0\) to \(x_t\): $$ \begin{align} x_t &= \sqrt{\bar{a}_t} x_0 + \sqrt{1-\bar{a}_t} \epsilon_1 \\ x_0 &= \frac{1}{\sqrt{\bar{\alpha}_t}}\left( x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_1 \right) \\ \mu_q(x_t, x_0) &= \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\sqrt{\bar{\alpha}_{t-1}}{\color{red}x_0}} {1-\bar{\alpha}_{t}} \\ &= \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t){\color{blue}\sqrt{\bar{\alpha}_{t-1}}}\frac{1}{\color{blue}\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_1) } {1-\bar{\alpha}_{t}} \\ &= \frac{ (1-\bar{\alpha}_{t-1})\sqrt{\alpha_t}x_t + (1-\alpha_t)\frac{1}{\color{blue}\sqrt{\alpha_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_1) } {1-\bar{\alpha}_{t}} \\ &= \frac{ (1-\bar{\alpha}_{t-1})\alpha_t x_t + (1-\alpha_t)(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_1) } {(1-\bar{\alpha}_{t})\sqrt{\alpha_t}} \\ &= \frac{[\alpha_t(1-\bar{\alpha}_{t-1})+1-\alpha_t]x_t - (1-\alpha_t)\sqrt{1-\bar{\alpha}_t}\epsilon_1} {(1-\bar{\alpha}_{t})\sqrt{\alpha_t}} \\ &= \frac{(1-\bar{\alpha}_{t})x_t - (1-\alpha_t)\sqrt{1- \bar{\alpha}_t}\epsilon_1} {(1-\bar{\alpha}_{t})\sqrt{\alpha_t}} \\ &= \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\epsilon_1 \quad \quad \text{then reparameterization}\\ &= \mu_q(x_t, t) + \epsilon(t) \\ \end{align} $$
Since \(q(x_{t-1}|x_t, x_0)\) is a Gaussian distribution, we can assume that \( p_\theta\left( x_{t-1} | x_t \right) \) is also a Gaussian distribution, then we can evaluate the KL-divergence between them. Instead of directly measuring the loss on the mean, we again use a reparemeterization so that we only measure the KL-divergence between \(\epsilon\): $$ \begin{align} p_\theta\left( x_{t-1} | x_t \right) &= \mathcal{N}(x_t; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \\ &= \mathcal{N}(x_t; \mu_q(x_t, t) + \epsilon_\theta(x_t, t), \Sigma_\theta(x_t, t)) \\ \mathcal{L}_{t} &= \sum_{t>1}^T D_{\text{KL}}\left({q\left(x_{t-1}|x_t, x_0\right)} || {p_\theta\left( x_{t-1} | x_t \right)}\right) \\ & \propto \mathbb{E}_{t} \left[ \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t}|| \epsilon_1 - \epsilon_\theta(x_t, t) ||^2_2 \right] \end{align} $$ In DDPM, we omit the estimation on the variances and assume that they are a constant (\(\mathbf{I}\)), then use the Mean Squared Error (MSE) as the final loss .
For the connection to other probabilistic models, recent literature, and hands-on experiences, I'd like to recommend either original papers or the awesome posts above.
This post will be updated for more contents and let me know any suggestions and bugs: jishang [at] cs &dot& stony brook %dot% edu
Last modified 2023/02.