2 minutes
Simulation Methods: Importance Sampling
The previous post motivated the need of simulation methods and introduced the Monte Carlo method. Here we’ll discuss another technique, namely importance sampling.
Importance Sampling
As in the previous post, suppose we are given an integral of the form, \(x \in A\):
\begin{equation} I = \int_{A} h(x)f(x)dx \tag{1} \label{eq:1} \end{equation}
Now, suppose we can’t (or do not know how to) sample from \(f(x)\). In such cases the basic Monte Carlo method falls apart.
One trick we can use is to introduce a probability density function we can sample from. Call it \(q(x)\), and rearrange Equation \(\eqref{eq:1}\) above as follows:
\begin{equation} I = \int_{A} \dfrac{h(x)f(x)}{q(x)} \cdot q(x)dx \end{equation}
\begin{equation} I = \mathbb{E}_{q}\biggl[\dfrac{h(x)f(x)}{q(x)}\biggr] \end{equation}
We can estimate this integral by sampling \(n\) values \(x_1, …, x_n\) from \(q(x)\), (i.e., \(x_i \sim q\), \(\forall i \in \{1,..,n\}\)) and computing the mean:
$$ \hat{I} = \dfrac{1}{n} \cdot \sum_i^n \dfrac{h(x_i)f(x_i)}{q(x_i)} \tag{2} \label{eq:2} $$
Such that for large \(n\), \(\hat{I} \rightarrow I\) according to the law of large numbers.
This trick is what is known as Importance Sampling.
So how do you choose \(\ q(x)\) ?
Consider the variance of \(\hat{I}\),
$$ Var(\hat{I}) = \frac{1}{n} \cdot Var\biggl(\dfrac{h(x)f(x)}{q(x)}\biggr) \tag{3} \label{eq:3} $$
For \(q(x) \ll h(x)f(x)\), the variance would be large. We would ideally like to have an estimate, \(\hat{I}\) with a small variance. Conversely if \(q(x) \gg h(x)f(x)\), for some values of \(x\), from Equation \eqref{eq:2}, the ratio \(\dfrac{h(x_i)f(x_i)}{q(x_i)}\) will be close to zero and therefore useless in the summation. Therefore \(q\) is generally chosen to be similar to \(h(x)f(x)\).
One could also optimize Equation \eqref{eq:3} and obtain the density function that minimizes the variance. (Wasserman 2004) provides a proof for the optimal choice of \(q\). However, it is generally very difficult to sample from this optimal distribution in practice.
References
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-21736-9.