2 Bayesian statistics 331-Week 2 2.2 Conjugates for likelihoods from the exponential family 2.4 Objective priors

2.3 The evidence for the model

The role of the marginal likelihood

The marginal likelihood, $m(y)=f(y\mid\mathcal{M})$ , (the denominator in Bayes theorem) is used by Bayesian for the purpose of model comparison. It is also known as the evidence in favour of the model. For model selection classical statisticians use measures of fit penalized by a measure of complexity. The marginal likelihood automatically does this. It uses the principle of Occam’s Razor to penalise large models.

The marginal likelihood or evidence

	$\displaystyle f(y\mid M)$	$\displaystyle=m(y)$
		$\displaystyle=\int_{\theta}f(y\|\theta)\pi(\theta)d\theta$

2.3.1 Evidence in favour of a model

The marginal likelihood

1.

Complex models with more parameters fit better and have a higher likelihood.
2.

However complex models can predict poorly. Occam’s razor applies to prediction.
3.

The marginal likelihood penalises the fit by a measure of complexity

Figure 2.1: Link, Caption: The dashed line represents the likelihood and the solid line is the marginal likelihood. With increasing model complexity, the likelihood continually increases, whereas the marginal likelihood has a maximum at some point, after which it begins to decrease.

How the marginal likelihood works

Using Bayes theorem the posterior is

p(\theta\mid Y,M)=\frac{p(y\mid\theta,M)p(\theta\mid M)}{p(y\mid M)}

Rewriting this with the marginal likelihood as the subject we get

p(y\mid M)=\frac{p(y\mid\theta,M)p(\theta\mid M)}{p(\theta\mid y,M)}

Writing this in log form we have

	$\displaystyle\log p(y\mid M)$	$\displaystyle=\underbrace{\log p(y\mid\theta,M)}_{\text{log-likelihood}}-\left% (\underbrace{\log p(\theta\mid y,M)-\log p(\theta\mid M)}_{\text{penalty}}\right)$
	$\displaystyle-2\log p(y\mid M)$	$\displaystyle=\rm{Deviance}+2\rm{Penalty}$

This is true for any $\theta$ including the MLE. $\log p(y\mid\theta,M)$ increases with model complexity as does the penalty.

2.3.2 Binomial observation and a beta prior

Binomial observation and a beta prior

	$\displaystyle m(y)$	$\displaystyle=\left(\begin{array}[]{c}n\\ y\\ \end{array}\right)\frac{\int_{\pi=0}^{1}\pi^{y}(1-\pi)^{n-y}\pi^{p-1}(1-\pi)^{% q-1}d\pi}{\mbox{\rm B}(p,q)}$
	$\displaystyle m(y)$	$\displaystyle=\left(\begin{array}[]{c}n\\ y\\ \end{array}\right)\frac{\mbox{\rm B}(y+p,n-y+q)}{\mbox{\rm B}(p,q)}$

This is the Beta-binomial distribution.

y\sim\mbox{\rm Beta-Binomial\,$\left({n},{p+y},{q+n-y}\right)$}

which has more variation than the binomial.

Figure 2.2: Link, Caption:

2.3.3 Poisson observation and a gamma prior

Poisson observation and a gamma prior

$\displaystyle m(y)$	$\displaystyle=$	$\displaystyle\int_{\theta=0}^{\infty}f(y\|\theta)\pi(\theta)d\theta$
	$\displaystyle=$	$\displaystyle\int_{\theta=0}^{\infty}\frac{1}{y!}e^{-\theta}\theta^{y}\frac{q^% {p}}{\Gamma(p)}\theta^{p-1}e^{-q\theta}$
	$\displaystyle=$	$\displaystyle\frac{q^{p}}{\Gamma(p)y!}\int_{\theta=0}^{\infty}\theta^{y+p-1}e^% {-\theta(q+1)}$
	$\displaystyle=$	$\displaystyle\frac{q^{p}}{y!\Gamma(p)}\frac{\Gamma(y+p)}{(q+1)^{y+p}}$
	$\displaystyle=$	$\displaystyle\frac{q^{p}}{y!\Gamma(p)}\frac{\Gamma(y+p)}{(q+1)^{y+p}}$
	$\displaystyle\propto$	$\displaystyle\left(\frac{q}{q+1}\right)^{p}\left(\frac{1}{q+1}\right)^{y}$

\implies y\sim\mbox{\rm Negative-Binomial\,$\left({p},{\frac{q}{q+1}}\right)$}

The long tail of the marginal likelihood

Figure 2.3: Link, Caption: The diagram compares the Poisson and the negative-binomial distribution. The negative-binomial can have more variation than the Poisson

The marginal likelihood and Bayes factors

1.

Bayesian inference is obtained by updating our belief in a parameter by multiplying our current belief by the likelihood of the current observation(s) and then normalising. This yields the posterior distribution which can become the prior for the next observation.
2.

Updating belief in a model is no different, in principle, to updating a parameter.
3.

The marginal likelihood is the denominator in Bayes theorem.
4.

The ratio of marginal likelihoods give us a Bayes factor which compares the weight of evidence in favour of two given models.

$BF_{2,1}=\frac{P(y\mid{\mathcal{M}}_{1})}{P(y\mid{\mathcal{M}}_{2})}$
5.

Bayes model selection is coherent and uses only the laws of probability.

Example

A coin is tossed 10 times resulting in 9 heads and one tail. The question of interest is whether the coin is biased or not. The null hypothesis, $H_{o}$ , is that the coin is fair, $(\pi=1/2)$ , with the alternative hypothesis, $H_{a}$ , is that the coin is biased ( $\pi\neq 1/2$ ).

(i)

Define the $p$ -value used in classical hypothesis testing. What is the implication of a low $p$ -value?
(ii)

Use a classical hypothesis test to test whether the coin is fair with a 5% significance level.
(iii)

If I assume (before the experiment) that each hypothesis is equally likely, calculate the Bayes Factor for $H_{a}$ relative to $H_{o}$ (If necessary state any further assumptions that you might need to make).
(iv)

What is the probability that the coin is biased?

Answer

(i)

A $p$ -value is the probability of getting a test statistic at least as extreme as that observed under the null hypothesis by pure chance. A low p-value is evidence against the null hypothesis.
(ii)

$p(X\in\{9,10\})=0.5^{10}+10*0.5^{10}=0.01074$ . The test is two tailed. Therefore the $p-$ value is $2\times 0.01074\approx 0.02$
Since $p<.05$ Reject $H_{o}$ in favour of $H_{a}$ with 95% certainty.
(iii)

If we assume that under $H_{a}$ , $\pi\sim\mbox{\rm Beta\,$\left({\alpha},{\beta}\right)$}$ where $\alpha=\beta=1$ . The normalize constant for the binomial is $k=10$

$\displaystyle\frac{p(y\mid H_{0})}{P(y\mid H_{a})}$ $\displaystyle=\frac{k\times 0.5^{10}}{\int p(y\mid\pi)P(\pi)d\pi}$

$\displaystyle=\frac{k\times 0.5^{10}}{k\times\textbf{B}(y+\alpha,n-y+\beta)}$

$\displaystyle=0.1074$

Answer

(iv)

Therefore if $P(H_{o})=p(H_{a})$ then

$\displaystyle\frac{1-p(H_{a})}{p(H_{a})}=0.1074$

$\implies p(H_{a})=\frac{1}{1+0.1074}=0.903$ [4]

	$\displaystyle\frac{p(y\mid H_{0})}{P(y\mid H_{a})}$	$\displaystyle=\frac{k\times 0.5^{10}}{\int p(y\mid\pi)P(\pi)d\pi}$
		$\displaystyle=\frac{k\times 0.5^{10}}{k\times\textbf{B}(y+\alpha,n-y+\beta)}$
		$\displaystyle=0.1074$