Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 5 additions & 12 deletions lectures/divergence_measures.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.17.1
jupytext_version: 1.16.6
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand Down Expand Up @@ -60,10 +60,6 @@ import pandas as pd
from IPython.display import display, Math
```





## Primer on entropy, cross-entropy, KL divergence

Before diving in, we'll introduce some useful concepts in a simple setting.
Expand Down Expand Up @@ -163,6 +159,8 @@ f(z; a, b) = \frac{\Gamma(a+b) z^{a-1} (1-z)^{b-1}}{\Gamma(a) \Gamma(b)}
\Gamma(p) := \int_{0}^{\infty} x^{p-1} e^{-x} dx
$$

We introduce two Beta distributions $f(x)$ and $g(x)$, which we will use to illustrate the different divergence measures.

Let's define parameters and density functions in Python

```{code-cell} ipython3
Expand Down Expand Up @@ -198,8 +196,6 @@ plt.legend()
plt.show()
```



(rel_entropy)=
## Kullback–Leibler divergence

Expand Down Expand Up @@ -457,15 +453,14 @@ plt.show()

We now generate plots illustrating how overlap visually diminishes as divergence measures increase.


```{code-cell} ipython3
param_grid = [
((1, 1), (1, 1)),
((1, 1), (1.5, 1.2)),
((1, 1), (2, 1.5)),
((1, 1), (3, 1.2)),
((1, 1), (5, 1)),
((1, 1), (0.3, 0.3))
((1, 1), (0.3, 0.3)),
((1, 1), (5, 1))
]
```

Expand Down Expand Up @@ -512,8 +507,6 @@ def plot_dist_diff(para_grid):
divergence_data = plot_dist_diff(param_grid)
```



## KL divergence and maximum-likelihood estimation


Expand Down
39 changes: 23 additions & 16 deletions lectures/imp_sample.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ import matplotlib.pyplot as plt
from math import gamma
```

## Mathematical Expectation of Likelihood Ratio
## Mathematical expectation of likelihood ratio

In {doc}`this lecture <likelihood_ratio_process>`, we studied a likelihood ratio $\ell \left(\omega_t\right)$

Expand All @@ -57,11 +57,10 @@ $$
Our goal is to approximate the mathematical expectation $E \left[ L\left(\omega^t\right) \right]$ well.

In {doc}`this lecture <likelihood_ratio_process>`, we showed that $E \left[ L\left(\omega^t\right) \right]$ equals $1$ for all $t$.

We want to check out how well this holds if we replace $E$ by with sample averages from simulations.

This turns out to be easier said than done because for
Beta distributions assumed above, $L\left(\omega^t\right)$ has
a very skewed distribution with a very long tail as $t \rightarrow \infty$.
This turns out to be easier said than done because for Beta distributions assumed above, $L\left(\omega^t\right)$ has a very skewed distribution with a very long tail as $t \rightarrow \infty$.

This property makes it difficult efficiently and accurately to estimate the mean by standard Monte Carlo simulation methods.

Expand Down Expand Up @@ -156,7 +155,7 @@ $$
E^g\left[\ell\left(\omega\right)\right] = \int_\Omega \ell(\omega) g(\omega) d\omega = \int_\Omega \ell(\omega) \frac{g(\omega)}{h(\omega)} h(\omega) d\omega = E^h\left[\ell\left(\omega\right) \frac{g(\omega)}{h(\omega)}\right]
$$

## Selecting a Sampling Distribution
## Selecting a sampling distribution

Since we must use an $h$ that has larger mass in parts of the distribution to which $g$ puts low mass, we use $h=Beta(0.5, 0.5)$ as our importance distribution.

Expand All @@ -178,7 +177,7 @@ plt.ylim([0., 3.])
plt.show()
```

## Approximating a Cumulative Likelihood Ratio
## Approximating a cumulative likelihood ratio

We now study how to use importance sampling to approximate
${E} \left[L(\omega^t)\right] = \left[\prod_{i=1}^T \ell \left(\omega_i\right)\right]$.
Expand Down Expand Up @@ -233,7 +232,7 @@ For our importance sampling estimate, we set $q = h$.
estimate(g_a, g_b, h_a, h_b, T=1, N=10000)
```

Evidently, even at T=1, our importance sampling estimate is closer to $1$ than is the Monte Carlo estimate.
Evidently, even at $T=1$, our importance sampling estimate is closer to $1$ than is the Monte Carlo estimate.

Bigger differences arise when computing expectations over longer sequences, $E_0\left[L\left(\omega^t\right)\right]$.

Expand All @@ -248,7 +247,17 @@ estimate(g_a, g_b, g_a, g_b, T=10, N=10000)
estimate(g_a, g_b, h_a, h_b, T=10, N=10000)
```

## Distribution of Sample Mean
The Monte Carlo method underestimates because the likelihood ratio $L(\omega^T) = \prod_{t=1}^T \frac{f(\omega_t)}{g(\omega_t)}$ has a highly skewed distribution under $g$.

Most samples from $g$ produce small likelihood ratios, while the true mean requires occasional very large values that are rarely sampled.

In our case, since $g(\omega) \to 0$ as $\omega \to 0$ while $f(\omega)$ remains constant, the Monte Carlo procedure undersamples precisely where the likelihood ratio $\frac{f(\omega)}{g(\omega)}$ is largest.

As $T$ increases, this problem worsens exponentially, making standard Monte Carlo increasingly unreliable.

Importance sampling with $q = h$ fixes this by sampling more uniformly from regions important to both $f$ and $g$.

## Distribution of sample mean

We next study the bias and efficiency of the Monte Carlo and importance sampling approaches.

Expand Down Expand Up @@ -323,17 +332,15 @@ The simulation exercises above show that the importance sampling estimates are u

Evidently, the bias increases with increases in $T$.

## Choosing a Sampling Distribution
## Choosing a sampling distribution

+++

Above, we arbitraily chose $h = Beta(0.5,0.5)$ as the importance distribution.

Is there an optimal importance distribution?

In our particular case, since we know in advance that $E_0 \left[ L\left(\omega^t\right) \right] = 1$.

We can use that knowledge to our advantage.
In our particular case, since we know in advance that $E_0 \left[ L\left(\omega^t\right) \right] = 1$, we can use that knowledge to our advantage.

Thus, suppose that we simply use $h = f$.

Expand Down Expand Up @@ -364,10 +371,10 @@ b_list = [0.5, 1.2, 5.]
```{code-cell} ipython3
w_range = np.linspace(1e-5, 1-1e-5, 1000)

plt.plot(w_range, g(w_range), label=f'p=Beta({g_a}, {g_b})')
plt.plot(w_range, p(w_range, a_list[0], b_list[0]), label=f'g=Beta({a_list[0]}, {b_list[0]})')
plt.plot(w_range, p(w_range, a_list[1], b_list[1]), label=f'g=Beta({a_list[1]}, {b_list[1]})')
plt.plot(w_range, p(w_range, a_list[2], b_list[2]), label=f'g=Beta({a_list[2]}, {b_list[2]})')
plt.plot(w_range, g(w_range), label=f'g=Beta({g_a}, {g_b})')
plt.plot(w_range, p(w_range, a_list[0], b_list[0]), label=f'$h_1$=Beta({a_list[0]},{b_list[0]})')
plt.plot(w_range, p(w_range, a_list[1], b_list[1]), label=f'$h_2$=Beta({a_list[1]},{b_list[1]})')
plt.plot(w_range, p(w_range, a_list[2], b_list[2]), label=f'$h_3$=Beta({a_list[2]},{b_list[2]})')
plt.title('real data generating process $g$ and importance distribution $h$')
plt.legend()
plt.ylim([0., 3.])
Expand Down
Loading
Loading