Importance Sampling

14 September 2017

Consider a probability distribution on \(\mathcal X\) with the density \(p\). (We will interchangeably denote the probability distribution and its density by the same letter). Given a function \(f: \mathcal X \to \mathbb R\), our goal is to estimate \begin{align} I := \E[f(X)] := \int_{\mathcal X} f(x) p(x) \,\mathrm dx. \end{align}

Monte Carlo Estimator

We can estimate this using \(N\) independently, identically distributed (iid) distributed samples from \(p\) as follows: \begin{align} X_n &\sim p \\
I_N^{\text{MC}} &= \frac{1}{N} \sum_{n = 1}^N f(X_n). \end{align}

The expectation of \(I_N^{\text{MC}}\) is equal to \(I\): \begin{align} \E\left[I_N^{\text{MC}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N I \\
&= I. \end{align}

The variance \(I_N^{\text{MC}}\) is: \begin{align} \Var\left[I_N^{\text{MC}}\right] &= \E\left[{I_N^{\text{MC}}}^2\right] - \E\left[I_N^{\text{MC}}\right]^2 \\
&= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N f(X_n)\right)^2\right] - I^2 \\
&= \frac{1}{N^2} \E\left[\sum_{n = 1}^N f(X_n)^2 + \sum_{n, \ell, n \neq \ell} f(X_n) f(X_\ell)\right] - I^2 \\
&= \frac{1}{N^2} \left(\sum_{n = 1}^N \E\left[f(X_n)^2\right] + \sum_{n, \ell, n \neq \ell} \E\left[f(X_n) f(X_\ell)\right]\right) - I^2 \\
&= \frac{1}{N^2} \left(N \E\left[f(X)^2\right] + (N^2 - N) \E\left[f(X)\right]^2 \right) - I^2 \\
&= \frac{\E\left[f(X)^2\right] - \E\left[f(X)\right]^2}{N} \\
&= \frac{\Var[f(X)]}{N}. \end{align}

Importance Sampling

Instead of drawing samples from \(p\), assume that we can

draw samples from a proposal distribution on \(\mathcal X\) with density \(q\),
evaluate density \(q\), and
evaluate density \(p\) (we will consider the case when we can only evaluate \(p\) up to a normalization constant later).

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\
W_n &= \frac{p(X_n)}{q(X_n)} \\
I_N^{\text{IS}} &= \frac{1}{N} \sum_{n = 1}^N W_n f(X_n). \label{eq:is_est} \end{align}

The expectation of \(I_N^{\text{IS}}\) is equal to \(I\): \begin{align} \E\left[I_N^{\text{IS}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[W_n f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[\frac{p(X_n)}{q(X_n)} f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} \frac{p(x_n)}{q(x_n)} f(x_n) q(x_n) \,\mathrm dx_n \\
&= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} p(x) f(x) \,\mathrm dx \\
&= I. \end{align}

The variance of \(I_N^{\text{IS}}\) is: \begin{align} \Var\left[I_N^{\text{IS}}\right] &= \E\left[{I_N^{\text{IS}}}^2\right] - I^2 \\
&= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right)^2\right] - I^2 \\
&= \frac{1}{N^2} \E\left[\sum_{n = 1}^N \left(W_n f(X_n)\right)^2 + \sum_{n, \ell, n \neq \ell} \left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\
&= \frac{1}{N^2} \sum_{n = 1}^N \E\left[\left(W_n f(X_n)\right)^2\right] + \frac{1}{N^2} \sum_{n, \ell, n \neq \ell} \E\left[\left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\
&= \frac{1}{N} \left(\int_{\mathcal X} \left(\frac{p(x)}{q(x)} f(x)\right)^2 q(x) \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \end{align}

The “optimal” proposal \(q^{\text{opt}}\) has the density \begin{align} q^{\text{opt}}(x) &= \frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}. \end{align} This proposal is optimal in the sense that estimator’s variance is lowest possible. The variance of an estimator using an optimal proposal, \(\Var\left[I_N^{\text{IS, opt}}\right]\) is not greater than the variance of an estimator using an arbitrary proposal \(q\), \(\Var\left[I_N^{\text{IS}}\right]\): \begin{align} \Var\left[I_N^{\text{IS, opt}}\right] &= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{\frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}} \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} \frac{p(x)^2 f(x)^2}{|f(x)| p(x)} \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} |f(x)| p(x) \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\left(\int_{\mathcal X} \frac{|f(x)| p(x)}{q(x)} \cdot 1 \cdot q(x) \,\mathrm dx\right)^2 - I^2 \right) \\
&\leq \frac{1}{N} \left(\left(\int_{\mathcal X} \left(\frac{|f(x)| p(x)}{q(x)}\right)^2 \cdot q(x) \,\mathrm dx\right) \cdot \left(\int_{\mathcal X} 1^2 \cdot q(x) \,\mathrm dx\right) - I^2 \right) & \text{(Cauchy-Schwarz ineq.)}\\
&= \frac{1}{N} \left(\int_{\mathcal X} \frac{f(x)^2 p(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \\
&= \Var\left[I_N^{\text{IS}}\right]. \end{align}

Self-Normalized Importance Sampling

Now, assume that we can

draw samples from a proposal distribution on \(\mathcal X\) with density \(q\),
evaluate density \(q\), and
evaluate the density \(p\) only up to a normalization constant, i.e. we can evaluate \(\tilde p(x) = Zp(x)\) where \(Z\) is the normalization constant.

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\
\tilde W_n &= \frac{\tilde p(X_n)}{q(X_n)} \\
\tilde I_N^{\text{IS}} &= \frac{\frac{1}{N}\sum_{n = 1}^N \tilde W_n f(X_n)}{\frac{1}{N}\sum_{n = 1}^N \tilde W_n}. \label{eq:self_norm_est} \end{align}

Consistency of The Estimator

Theorem. \(\tilde I_N^{\text{IS}}\) converges to \(I\) almost surely, i.e. \begin{align} P\left(\left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}\right) = 1. \end{align}

Proof. Denote the numerator of \eqref{eq:self_norm_est} by \(\alpha_N\) and the denominator by \(\beta_N\). By the law of the strong law of large numbers, \(\alpha_N\) converges to \(ZI\) and \(\beta_N\) converges to \(Z\). Now, let \begin{align} A = \left\{\omega: \lim_{N \to \infty} \alpha_N(\omega) = ZI\right\} \\
B = \left\{\omega: \lim_{N \to \infty} \beta_N(\omega) = Z\right\} \\
C = \left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}. \end{align} We have \(P(A) = P(B) = 1\) and hence \(P(A \cap B) = P(A) + P(B) - P(A \cup B) = 1\) (see properties probability measures). We also have \(A \cap B \subseteq C\) since for any \(\omega \in A \cap B\), \(\lim_{N \to \infty} \alpha_N(\omega) = ZI\) and \(\lim_{N \to \infty} \beta_N(\omega) = Z\) and given the properties of a limit, \(\lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = \lim_{N \to \infty} \frac{\alpha_N(\omega)}{\beta_N(\omega)} = \frac{ZI}{Z} = I\). Hence \(P(C) \geq P(A \cup B) = 1\). Since \(P(C) \leq 1\), \(P(C) = 1\). \(\square\)

Asymptotic Variance

It is more difficult to compute the bias and variance of the self-normalized importance sampling estimator \eqref{eq:self_norm_est} because it cannot be decomposed to a sum of independent terms as in \eqref{eq:is_est}. We resort to the central limit theorem and the delta method which say the following.

Central limit theorem. Given iid random vectors \(v_n\) (\(n = 1, 2, \dotsc\)) on \(\mathbb R^D\) with a common mean \(\mu \in \mathbb R^D\) and covariance \(\Sigma \in \mathbb R^{D \times D}\), \begin{align} \sqrt{N} \left(\frac{1}{N} \sum_{n = 1}^N v_n - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} where \(\xrightarrow{d}\) denotes convergence in distribution and \(\mathrm{Normal}(0, \Sigma)\) denotes a multivariate normal distribution with the covariance \(\Sigma\). \(\triangle\)

Delta method. Given \(\mu \in \mathbb R^D\) and its estimator \(\hat \mu_N\) with the following convergence \begin{align} \sqrt{N} \left(\hat \mu_N - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} then, given a function \(g: \mathbb R^D \to \mathbb R\) and the corresponding gradient \(\nabla g: \mathbb R^D \to \mathbb R^D\), we have \begin{align} \sqrt{N} \left(g(\hat \mu_N) - g(\mu)\right) \xrightarrow{d} \mathrm{Normal}\left(0, \nabla g(\mu)^T \Sigma \nabla g(\mu)\right). \end{align} \(\triangle\)

We can apply these theorems directly to self-normalized importance sampling, where \begin{align} v_n &= \begin{bmatrix} W_n f(X_n) \\
W_n \end{bmatrix}, \\
g\left(\begin{bmatrix} a \\
b \end{bmatrix}\right) &= \frac{a}{b}, \end{align} so that \begin{align} \mu &= \begin{bmatrix} ZI \\
Z \end{bmatrix}, \\
\nabla g\left(\begin{bmatrix} a \\
b \end{bmatrix}\right) &= \begin{bmatrix} 1 / a \\
-a / b^2 \end{bmatrix}, \\
\Sigma &= \begin{bmatrix} \Var(W_n f(X_n)) & \Cov(W_n, W_n f(X_n)) \\
\Cov(W_n, W_n f(X_n)) & \Var(W_n) \end{bmatrix}. \end{align} This results in the following convergence for the self-normalized importance sampling estimator: \begin{align} \sqrt{N} (\tilde I_N^{\text{IS}} - I) \xrightarrow{d} \mathrm{Normal}\left(0, \tilde \sigma_{\text{asymptotic}}^2\right), \end{align} where the asymptotic variance is \begin{align} \tilde \sigma_{\text{asymptotic}}^2 = \int \frac{p(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}

Optimal Proposal

A proposal that minimizes the asymptotic variance \(\tilde \sigma_{\text{asymptotic}}^2\) has the following form \begin{align} \tilde q^{\text{opt}}(x) &= \frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}. \end{align}

To see this, we can show that the asymptotic variance associated with \(\tilde q^{\text{opt}}\) is not greater than the asymptotic variance associated with any other \(q\): \begin{align} \int \frac{p(x)^2 (f(x) - I)^2}{\tilde q^{\text{opt}}(x)} \,\mathrm dx &= \int \frac{p(x)^2 (f(x) - I)^2}{\frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}} \,\mathrm dx \\
&= \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \cdot \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \\
&= \E_q\left[\frac{\pi}{q} |f - I|\right]^2 && \text{(Jensen’s ineq.)} \\
&\leq E_q\left[\frac{\pi^2 |f - I|^2}{q^2}\right] \\
&= \int \frac{\pi(x)^2 |f(x) - I|^2}{q(x)} \,\mathrm dx \\
&= \int \frac{\pi(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}

[back]

Tuan Anh Le

Importance Sampling

Monte Carlo Estimator

Importance Sampling

Self-Normalized Importance Sampling

Consistency of The Estimator

Asymptotic Variance

Optimal Proposal