3 minutes
Simulation Methods: Monte Carlo
Motivation
Simulation methods are often used in Bayesian inference to compute definite integrals and for parameter estimation.
In Bayesian reasoning a model’s parameter \(\theta\) is a random variable with some distribution \(f(\theta)\) called the prior distribution. Given some data \(x_1, x_2, …, x_n\), we are often interested in the posterior density:
$$ f(\theta | x_1, x_2, … x_n) = \dfrac{f(x_1, x_2, …, x_n|\theta) \cdot f(\theta)}{f(x_1, x_2, …,x_n)} $$
The posterior mean gives an estimate of the parameter θ known as the Bayes estimate:
$$ \hat{\theta} = \int\theta \cdot f(\theta| x_1, …, x_n)d\theta $$
For a multi-dimensional parameter, \(\vec{\theta} = (\theta_1, \theta_2, …,\theta_p)^{\top}\), estimating a single component would require high-dimensional integration. For example: $$ f(\theta_1| x_1, …, x_n) = \int\int … \int f(\vec{\theta} | x_1, … x_n)d\theta_2…d\theta_p $$ Such integrals might be intractable. Hence the need for alternative methods, e.g. methods based on statistical sampling (there exists other numerical methods such as Simpson’s rule.).
Monte Carlo Integration
Given an integral of the form, \(x \in A\):
$$ I = \int_{A} h(x)f(x)dx $$
\(I\) can be estimated by noticing that it is an expectation of the function \(h(x)\) over a random variable \(x \sim f(x)\), assuming \(f(x)\) qualifies as a probability density function, i.e., \(f(x) \ge 0\) \(\forall x\) and \(\int_xfdx = 1\). Effectively, $$ I = \mathbb{E}_{f}h(x) $$
\(I\) can then be approximated by drawing \(n\) samples from \(f\) (\(x_1, …, x_n\)) and calculating the mean as such: $$ \hat{I} = \dfrac{1}{n} \cdot \sum_i^n h(x_i) $$
For large \(n\), \(\hat{I} \rightarrow I\) according to the law of large numbers.
We can also estimate the variance (and the standard error by taking the square root) of this estimate: $$ Var(\hat{I}) = \dfrac{1}{n^2} \cdot \sum_i^n Var(h(x_i)) = \dfrac{1}{n} \cdot Var(h(x)) $$ because for all \(i\), \(h(x_i)\) have the same variance. This variance can be estimated with: $$ \widehat{Var(h(x))} = \dfrac{\sum_i^n(h(x_i) - \hat{I})^2}{n-1} $$
Example: Let’s try a simple example, \(I = \int_0^1 e^x\) (which we know is \(e - 1 = 1.72\)), using monte carlo integration:
If we let \(x \sim Unif(0, 1)\), we get \(f(x)\) to be the uniform density over [0, 1]. This implies \(f(x) = 1\). So we can start by sampling 1000 values from this distribution.
import numpy as np
n = 1000 # sample size
x = np.random.uniform(0, 1, n)
def h(x):
return np.exp(x)
def I(h):
return np.mean(h)
def variance(n, h, I):
var = (1 / (n - 1)) * (np.sum(np.square(h - I)))
return var / n
h = h(x)
I = I(h)
var = variance(n, h, I)
print("Estimated I: {}, Variance: {}".format(I, var))
## Estimated I: 1.7399777725065875, Variance: 0.00024057715891182368
You can play around with the value of \(n\) in the code above and observe that the estimate tends to the true value for larger \(n\) giving a smaller variance.
References
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-21736-9.