Markov chain: Introduction

The Metropolis-Hastings algorithm is a Markov Chain Monte Carlo (MCMC) method. Bluntly speaking MCMC makes use of the so called Markov chain to generate samples of a distribution of interest. Therefore, in order to understand why the Metropolis-Hastings algorithm works, we have to talk about Markov chains.

In this post, we introduce the Markov chain and define the stationary and limiting distributions using an example inspired by examples in (Fewster 2014).

Markov chain

A sequence of random variables $\{X_0, X_1, X_2,\cdots\}$ is called a Markov chain if it satisfies the Markov property:

\begin{equation} P(X_t = s_t | X_{t-1} = s_{t-1}, X_1 = s_1, \cdots, X_0 = s_0) = P(X_t=s_t| X_{t-1} = s_{t-1}) \tag{1} \label{eq:1} \end{equation}

for all $t \in \{1, 2, 3, \cdots\} $ and all states $s_0, s_1, \cdots, s_t$ in some set $\mathcal{S}$ called state space.

In this series we are interested in discrete-time (because $t$ is discrete) homogeneous Markov chains (with the following property):

\begin{equation} P(X_t = j | X_{t-1} = i) = P(X_1 = j | X_0 = i) \tag{2} \label{eq:2} \end{equation} for all $t \in \{1, 2, 3, \cdots\}$

For all state pairs $\{(i, j) | i, j \in S\}$ we can therefore define the transition probability from state $i$ to state $j$ denoted $p_{ij}$:

\begin{equation} p_{ij} = P(X_{t} = j | X_{t-1} = i) \tag{3} \label{eq:3} \end{equation}

The matrix $\mathbf{P}$ (an $|\mathcal{S}| \times |\mathcal{S}|$ matrix) whose $(i, j)$ element is $p_{ij}$ is called the transition matrix (depicted below).

Note here that for $\mathbf{P}$ to be defined $\mathcal{S}$ has to be finite. Additionally, the rows of the matrix sum to 1, i.e., $\sum_{j} p_{ij} = 1$.

Example: An example transition matrix of a Markov chain is given below:

$$ \mathbf{P} = \begin{matrix} & \begin{matrix} A & B & C \end{matrix} \\ \begin{matrix} A \\ B \\ C \end{matrix} & \begin{pmatrix} 0.2 & 0.6 & 0.2 \\ 0.4 & 0.3 & 0.3 \\ 0.1 & 0.2 & 0.7 \\ \end{pmatrix} \\ \end{matrix} $$

We let $\mathcal{S}$ be the set $\{A, B, C\}$. The corresponding transition diagram is visualized below:

$p_{ij}$ is the probability of jumping from state $i$ to state $j$ in one step. The probability of transitioning from $i$ to state $j$ in $t$ steps is written $p_{ij}(t)$.

\begin{equation} p_{ij}(t) = P(X_{n+t} = j | X_{n} = i) = P(X_{t} = j | X_{0} = i) \tag{4} \label{eq:4} \end{equation}

The second equality above holds because a homogeneous chain is assumed, as mentioned before.

The t-step transition matrix $\mathbf{P}_t$ can be simply calculated from the single step transition matrix $\mathbf{P}$.

\begin{equation} \mathbf{P}_t = \mathbf{P}^{t} = \underbrace{\mathbf{P} \times \mathbf{P} \times \cdots \times \mathbf{P}}_{\text{t times}} \tag{5} \label{eq:5} \end{equation}

\begin{equation} \implies p_{ij}(t) = (\mathbf{P}_t)_{ij} \end{equation}

Equivalently,

\begin{equation} \mathbf{P}_{m + n} = \mathbf{P}_m \mathbf{P}_n \tag{6} \label{eq:6} \end{equation}

\begin{equation} \implies p_{ij}(m + n) = \sum_{k} p_{ik}(m)p_{kj}(n) \end{equation}

These equations are known as the Chapman-Kolmogorov equations. These results are proven in (Wasserman 2004).

By convention,

\begin{equation} \mathbf{P}_0 = \mathbf{I}_n \end{equation} where $ n = |\mathcal{S}|$

\begin{equation} \implies p_{ij}(0) = \delta_{ij} = \begin{cases} 1, & \text{if } i = j \\ 0, & \text{if } i \neq j \end{cases} \end{equation}

Additionally, the probability that the chain is in a particular state, say $i$, at a particular time $t$ is written: $$ \mu_{t}(i) = P(X_{t} = i) $$

So $\vec{\mu}_{t} = (\mu_{t}(1), \cdots, \mu_{t}(N))$ represents the distribution of the chain at a particular time $t$ given $\mathcal{S} = \{1, \cdots, N\}$. We say $X_t$ has distribution $\vec{\mu}_{t}$ or mathematically $X_t \sim \vec{\mu}_{t}$

If $\vec{\mu}_{0}$ is known, $\vec{\mu}_{t}$ can be calculated for any $t \geq 1$:

\begin{equation} \vec{\mu}_{t} = \vec{\mu}_{0}^{\top} \mathbf{P}_t = \vec{\mu}_{0}^{\top} \mathbf{P}^t \tag{7} \label{eq:7} \end{equation}

For the purpose of the Markov chain Monte Carlo methods we are interested in the equilibrium/stationary distribution of the chain. The stationary distribution is often denoted $\vec{\pi}$. If a Markov chain is at equilibrium at a particular time $t$, then it will stay at equilibrium forever, i.e., if $X_{t} \sim \vec{\pi}$ then $X_{t + n} \sim \vec{\pi}$ for any $n \geq 0$.

Example: Taking the Markov chain with the transition matrix in the previous example, and letting $X_0 \sim (\frac{1}{3},\frac{1}{3},\frac{1}{3})$ we see rapid convergence.

Figure 1: Convergence to stationary distribution

From $\vec{\mu}_{4}$ on, the distribution does not change anymore. The actual equilibrium distribution for this chain is $\vec{\pi} = (0.2173913, 0.31884058, 0.46376812)^{\top}$. The code for computing this can be found in the Appendix.

We also observe in this example that the rows of $\mathbf{P}^{t}$ are converging to a limiting distribution , $\lim_{t\to\infty} p_{ij}(t)$, which is the stationary distribution $\pi_j$.

import numpy as np
from numpy.linalg import matrix_power

P = np.array([[0.2, 0.6, 0.2], [0.4, 0.3, 0.3], [0.1, 0.2, 0.7]])
pi = np.array([0.2173913, 0.31884058, 0.46376812])

limiting_distribution = np.round(np.zeros((3, 3)) + pi, 8)
P_26 = np.round(matrix_power(P, 26), 8)
P_200 = np.round(matrix_power(P, 200), 8)
P_300 = np.round(matrix_power(P, 300), 8)

np.testing.assert_array_equal(P_26, limiting_distribution)
np.testing.assert_array_equal(P_200, limiting_distribution)
np.testing.assert_array_equal(P_300, limiting_distribution)

# True for any t >= 26

This means that, no matter the state we start at, i.e., no matter what $X_0$ is, the $P(X_t = A | X_0) = 0.2173913$, $P(X_t = B | X_0) = 0.31884058$ and $P(X_t = C | X_0) = 0.46376812$ as $t \rightarrow \infty$. This is illustrated graphically below:

For the Markov chain given as example in this post, the stationary distribution of the chain and the limiting distribution for transition matrix exist. The next article shall discuss the conditions that guarantee the existence of the stationary distribution and limiting distribution in general.

Appendix

Code to generate Figure 1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rc
from numpy.linalg import matrix_power

sns.set_theme(style="white")

rc("text", usetex=True)

P = np.array([[0.2, 0.6, 0.2], [0.4, 0.3, 0.3], [0.1, 0.2, 0.7]])

mu_0 = np.array([1 / 3, 1 / 3, 1 / 3])
mu_1 = np.dot(mu_0, P)
mu_2 = np.dot(mu_1, P)
mu_3 = np.dot(mu_2, P)
mu_4 = np.dot(mu_3, P)
mu_5 = np.dot(mu_4, P)
mu_600 = np.dot(mu_0, matrix_power(P, 600))
mu_601 = np.dot(mu_0, matrix_power(P, 601))
mu_602 = np.dot(mu_0, matrix_power(P, 602))
mu_603 = np.dot(mu_0, matrix_power(P, 603))
mu_604 = np.dot(mu_0, matrix_power(P, 604))
mu_605 = np.dot(mu_0, matrix_power(P, 605))

states = ["A", "B", "C"]

data = {
    "mu_0": mu_0,
    "mu_1": mu_1,
    "mu_2": mu_2,
    "mu_3": mu_3,
    "mu_4": mu_4,
    "mu_5": mu_5,
    "mu_600": mu_600,
    "mu_601": mu_601,
    "mu_602": mu_602,
    "mu_603": mu_603,
    "mu_604": mu_604,
    "mu_605": mu_605,
}
df = pd.DataFrame(data, index=states)
_, axes = plt.subplots(2, 6)
axes = axes.flatten()


for index, key in enumerate(df.keys()):
    ax = sns.barplot(
        x=states, y=data[key], color="salmon", saturation=0.5, ax=axes[index]
    ).set(title="$\\vec{\mu}_{" + key.split("_")[1] + "}$")

plt.tight_layout()

Code to generate Figure 2

d = np.zeros((100, 3, 3))
for i in range(100):
    d[i] = matrix_power(P, i + 1)

from_A = pd.DataFrame(d[0:100, 0], columns=["A", "B", "C"]).reset_index()
from_B = pd.DataFrame(d[0:100, 1], columns=["A", "B", "C"]).reset_index()
from_C = pd.DataFrame(d[0:100, 2], columns=["A", "B", "C"]).reset_index()

data = {"from_A": from_A, "from_B": from_B, "from_C": from_C}

for key in data.keys():
    data[key] = pd.melt(
        data[key], id_vars=["index"], value_vars=["A", "B", "C"], var_name="state",
    )

_, axes = plt.subplots(1, 3)
axes = axes.flatten()

for index, key in enumerate(data.keys()):
    ax = sns.lineplot(
        x="index", y="value", hue="state", data=data[key], ax=axes[index]
    ).set(xlabel="$t$", ylabel=f"$P(X_t = s | X_{0} = {key.split('_')[1]})$")

    axes[index].legend().set_title("state, $s$")

plt.tight_layout()

Finding the stationary distribution

We are interested in the stationary distribution of the Markov chain in the example above.

When a Markov chain reaches equilibrium the following equation holds:

\begin{equation} \vec{\pi}^{\top} \mathbf{P} = \vec{\pi}^{\top} \tag{8} \label{eq:8} \end{equation}

This comes from the definition of the stationary distribution: When a Markov chain reaches equilibrium is stays at equilibrium forever. $\vec{\pi}^{\top}\mathbf{P}$ is the distribution at a single time step after equilibrium, and this must be $\vec{\pi}$ as per the definition (the matrix $\mathbf{P}$ is from our example throughout this post).

We also know that: \begin{equation} \sum_{s} \pi_{s} = \pi_A + \pi_B + \pi_C = 1 \tag{9} \label{eq:9} \end{equation}

for $s \in \mathcal{S}$. Therefore we have to solve the equations above for the value of $\vec{\pi}$:

\begin{align} 0.2\pi_A + 0.4\pi_B + 0.1\pi_C &= \pi_A \\ 0.6\pi_A + 0.3\pi_B + 0.2\pi_C &= \pi_B \\ 0.2\pi_A + 0.3\pi_B + 0.7\pi_C &= \pi_C \\ \pi_A + \pi_B + \pi_C &= 1 \end{align}

These equations can be rewritten: \begin{align*} -0.8\pi_A + 0.4\pi_B + 0.1\pi_C &= 0 \\ 0.6\pi_A - 0.7\pi_B + 0.2\pi_C &= 0 \\ 0.2\pi_A + 0.3\pi_B - 0.3\pi_C &= 0 \\ \pi_A + \pi_B + \pi_C &= 1 \end{align*}

which is of the form $\mathbf{A} \vec{\pi} = \vec{b}$. However this is an overdetermined system of equations (there are more equations than unknowns). We can verify that the columns of $A$ are linearly independent (the matrix has rank 3) and use the moore-penrose pseudoinverse to solve the system of equations.

$$ \vec{\pi} = (\mathbf{A}^{\top}\mathbf{A})^{-1}\mathbf{A}^{\top}\vec{b} $$

A = np.append(P.T - np.identity(3), [[1, 1, 1]], axis=0)
assert np.linalg.matrix_rank(A) == 3
b = np.array([0, 0, 0, 1]).T

pi = np.linalg.solve(A.T.dot(A), A.T.dot(b))
print(pi)

array([0.2173913 , 0.31884058, 0.46376812])

References

Fewster, Rachel. 2014. “COURSE NOTES STATS 325 Stochastic Processes.” .

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-21736-9.