MCMC

Introduction

Markov Chain Monte-Carlo (MCMC) is a method of sampling from a distribution, when it is difficult to sample from the distribution directly. It can be used with Monte-Carlo Integration for numerical calculation of integrals.

It explores the state space $X$ using a Markov Chain mechanism, and produces samples $x_i$ which mimic samples drawn from the distribution $p(x)$ directly. We assume that we can evaluate $p(x)$ up to a normalizing constant, but it is difficult to draw samples from it.

Let us consider the situation when the number of possible states in a Markov chain is $n$. Then we can choose an initial distribution $q(x) \in \mathbb{R}^n$ and have a transition matrix $T \in \mathbb{R}^{n \times n}$. Following the transition matrix at the first step, we get the distribution $qT$. If we repeat the process several times and it converges to a distribution $p(x)$, we say it converged to an invariant distribution. The process converges if the matrix $T$ has properties of irreducibility and aperiodicity.

Definition 1. A transition matrix $T$ is irreducible if for any state of the Markov chain there is a positive probability of visiting all other states.

Definition 2. A transition matrix $T$ is aperiodical if the Markov chain does not get trapped into cycles.

A sufficient condition for the transition matrix to satisfy these requirenments is the reversibility condition:

$\displaystyle{p(x_i) T(x_j|x_i) = p(x_j) T(x_i|x_j).}$

In this case $p(x)$ here is the invariant distribution. Indeed, summing both sides over $x_j$ gives us

$\displaystyle{p(x_i) = \sum _{x_j} p(x_j) T(x_i|x_j).}$

In continuous state spaces the transition matrix $T$ becomes an integral kernel $K$ and $p(x)$ becomes the corresponding eigenfunction.

MCMC samples are organized in a way that the desired $p(x)$ is the invariant distribution:

The Metropolis-Hastings algorithm

Let us have a proposal distribution $q(x^* | x)$ and invariant distribution $p(x)$. An MH step involves sampling from the proposal distribution a candidate value $x^*$. The Markov chain then moves towards $x^*$ with acceptance probability $A(x,x^*) = \min\{1,\frac{p(x^*)q(x|x^*)}{p(x)q(x^*|x)}\}$, otherwise it remains at $x$.

Initialize $x_0$;
FOR {$i=0,...,N-1$}
Sample $u \sim U_{[0,1]}$ from the uniform distribution.
Sample $x^* \sim q(x^* |x_i)$ from the proposal distribution.
If $u < A(x_i,x^*)$ then $x_{i+1} = x^*$ else $x_{i+1}=x_i$.
END FOR.

If $q(x^* | x) = q(x | x^*)$ the MH algorithm transforms to the Metropolis algorithm.

Integration using MCMC

To calculate the integral $\int f(x) p(x)dx$ it is possible to sample from the uniform distribution and calculate the integral using Monte-Carlo integration. But MCMC method of sampling from $p(x)$ may lead to much better convergence. First one needs to perform burn-in state, when a Markov chain finds a way to sample from a distribution close to $p(x)$ using MH algorithm. Then search for the distribution continues, but the sum $f(x_i)$ is calculated. The resulting approximation of the integral is $1/N_2 \sum_{j=N_1}^{N_1+N_2} f(x_j)$, where $N_1$ is the length of the burn-in stage, and $N_2$ is the length of the sampling stage.

MCMC is also used to solve optimization problems like $\hat x = \arg\max_{x \in X} p(x)$ by applying a so-called simulated annealing technique.

Bibliography

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. Equations of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21(6): 1087–1092 (1953).
  • Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109 (1970).
  • Christophe Andrieu et al. An Introduction to MCMC for Machine Learning, Machine Learning, 50: 5–43 (2003).