1 Week 1- Bayesian inference.

1.2 Conjugacy

Conjugacy

A prior is conjugate for a given likelihood if both the prior and posterior have the same parametric form.

We will see in the next chapter that all likelihoods from the exponential family have conjugate priors. Here we look at a few examples.

1.2.1 Beta-binomial conjugacy

The binomial

(Binomial sample.) Suppose our likelihood model is xBinomial (n,θ), and we wish to make inferences about θ, from a single observation x.

So

f(x|θ)=(nx)θx(1-θ)n-x;

So, in this case, suppose we can represent our prior beliefs about θ by a beta distribution:

θBeta (p,q)

so that

π(θ) = Γ(p+q)Γ(p)Γ(q)θp-1(1-θ)q-1   (0θ1)
θp-1(1-θ)q-1.

The parameters of this distribution are p>0 an d q>0. (They are NOT probabilities and may have any positive value.) The mean and variance of this distribution are

E(θ) =m=pp+q
andVar(θ) =v=pq(p+q)2(p+q+1).

The Beta distribution is written as

π(θ) = θp-1(1-θ)q-1B(p,q)whereB(p,q)=Γ(p)Γ(q)Γ(p+q)=01θp-1(1-θ)q-1dθ.

We call B(p,q) the beta function; don’t confuse it with the distribution.

In Figure Figure 1.10 (Link) some simple cases of the beta distribution are shown.

Figure 1.10: Link, Caption: Use of the beta distribution to describe a variety of belief about a proportion.

Bayes Rule Now we apply Bayes Theorem using this prior distribution:

π(θx) π(θ)f(x|θ)
θp-1(1-θ)q-1×θx(1-θ)n-x
= θp+x-1(1-θ)q+n-x-1
= θP-1(1-θ)Q-1

There is only one density function proportional to this, so it must be the case that

θ|xBeta (P,Q).

The updates are

P p+x
Q q+n-x. (Updates for Beta prior with a Binomial likelihood)

In other words, the number of successes to added to the first parameter of the Beta and the number of failures to the second parameter. This does not have to be done all at once. It can be done observation by observation.

Sequential updating of belief in a parameter

Figure 1.11: Link, Caption: The diagram shows how the posterior can be calculated sequentially

.

If the data consists of a sequence of shots on goals by a player. They are denoted by Y=1,1,0,1. Then our belief in the ability of the player can be updated sequentially.

Beta (1,1)y=1Beta (2,1)y=1Beta (3,1)y=0,Beta (3,2)y=1Beta (4,2)

The expected sucess rates are

π^0=12,π^1=23π^2=34,π^3=35,π^4=46

Sequential inference with the Binomial

Let us take an example of a run of successes and failures in a basketball game. For each success pp+1. For each failure qq+1

y <- c(1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
    1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0)
ΨΨ#several rows ommitted
p <- 1 ;q <- 1
mean <- rep(0, 200); uq <- rep(0, 200); lq <- rep(0, 200)
for (i in 1:200) {
    p <- p + y[i]
    q <- q + 1 - y[i]
    mean[i] <- p/(p + q)
    lq[i] <- qbeta(0.025, p, q)
    uq[i] <- qbeta(0.975, p, q)
}

Sequential inference with the Binomial

Figure 1.12: Link, Caption: The panel on the left shows predicted probability calculated sequentially where all observations are considered equally with no drift or forgetting. The estimates in the second panel emphasize current form by forgetting observations from the distant past. This model is more responsive to changes

Forgetting

Bayesian learning and Bayesian forgetting

In stationary models (when θ is fixed) as more and more observations are made we become more and more certain about the parameter an intervals for the parameter (θ) become narrower.

In a model with forgetting, the parameter θ changes with time. A good example of this is when θ is the current form of a current sports team. As the team changes θ the form of the team changes. When θ changes recent results are judged more relevant than the results of games from the distant past.

Click this link for a good video on Bayesian inference on a beta distribution: https://www.coursera.org/learn/bayesian/lecture/xFRKb/inference-on-a-binomial-proportion

1.2.2 Gamma-Poisson conjugacy

Examle: A Poisson sample.

Suppose we have a random sample (i.e. independent observations) of size n, x=(x1,x2,,xn) of a random variable X whose distribution is Poisson (θ) Then

f(x|θ) = i=1ne-θθxixi!=L(θ;x)
e-nθθΣxi

As in the binomial example, prior beliefs about θ will vary from problem to problem, but we’ll look for a form which gives a range of different possibilities, but is also mathematically tractable.

A conjugate Gamma prior.

case we suppose our prior beliefs can be represented by a gamma distribution:

θGamma (p,q),

so

π(θ)=qpΓ(p)θp-1exp{-qθ}  (θ>0).

The parameter p>0 is a shape parameter, and q>0 is a scale parameter. The mean and variance of this distribution are

(θ)=m=pqandVar(θ)=v=pq2.

The Gamma distribution

Figure 1.13: Link, Caption: Gamma distributions with the same mean but different scale parameters. Notice the inverse relationship between variance of the data and the scale parameter of the Gamma distribution.

Updating a Gamma prior

Assume we have a Poisson likelihood with a gamma prior then by applying Bayes’ Theorem with this prior distribution we get,

π(θx) qpΓ(p)θp-1exp{-qθ}×exp{-nθ}θΣxi
θ(p+Σxi-1)exp{-(q+n)θ}
= θP-1exp(-Qθ).

Again, there is only one density function proportional to this, so it must be the case that

θ|xGamma (P,Q),

This is nother gamma distribution whose parameters are modified by the sum of the data, i=1nxi, and the sample size n. The updates are

P p+i=1nxi
Q q+n. (Updates for Gamma prior with a Poisson likelihood)

Sequential inference with the Poisson

Let us take an example of drug arrest from a particular patrol car and update our parameters sequentially: pp+y and qq+1.

y <- c(0, 1, 0, 0, 7, 0, 0, 0, 0, 0, 1, 3, 0, 4, 0, 1, 0, 2, 2, 0, 0, 0,
    0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 2, 1, 0, 1, 0,
    0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 3, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0,
    0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    3, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 1, 0,
    0, 0, 1, 2, 1, 2, 11, 2, 4, 0, 2, 4, 7, 5, 12, 6, 4, 6, 0, 4, 9, 4,
    8, 7, 2, 9, 10, 16, 19, 2)
p <- 1; q <- 1;n <- length(y)
mean <- rep(0, n); luq <- rep(0, n);lq <- rep(0, n)
for (i in 1:n) {
    p <- p + y[i];   q <- q + 1;   mean[i] <- p/q
    lq[i] <- qgamma(0.025, p, q);  uq[i] <- qgamma(0.975, p, q)
}

Sequential inference by updating the Gamma

arson.png

The panel on the left shows drug related reports, The posteriors are calculated sequentially. Here all observations are considered equally. Note how the uncertainty expected number of drug arrests (the shaded region) goes down in time. In the second panel the past is downweighted and recent history is emphasised.

Football example

Let i{1,2,,20} denote the home team and j{1,2,,20} denote the away team and the games in chronological labelled as t=1,,380.

xi,jt Poisson (μi,j), μi,j=αiβjγ (Likelihood for goals of home team )
yj,it Poisson (λj,i), λj,i=αjβi. (Likelihood for goals of away team)

where αi denotes the attacking strength of team i, βj is the defensive strength of team j and γ is the common home ground advantage. The priors for the Poisson model are given below. δ is fixed at say δ=10.

αi Gamma (δ,δ) (Priors for attacking strengths)
βj Gamma (δ,δ) (Priors for defensive strengths)
γ Gamma (δ,δ) (Priors for home ground advantage)

Football example

Let us say Liverpool is playing Arsenal at Home. The prior attacking and defensive strengths before the game are

αL Gamma (2,1) αAGamma (3,2)
βL Gamma (1,3) βAGamma (3,4)
γ Gamma (3,2)

Let us say Liverpool win 4-1.

  1. 1.

    Find the expected attacking strength, defensive strengths and HGA before the game.

  2. 2.

    Find the prior expected score.

  3. 3.

    Write out the likelihood for the scores of the home and away teams.

  4. 4.

    Find the posterior distributions of all five parameters.

  5. 5.

    Update the Gamma parameters for their attacking and defensive strengths, and HGA after game.

  6. 6.

    Which teams improved their attacking strength and defensive strengths?

Attacking strength 2017-18

fotball20018A.png

Attacking strength shown through the 2017-18 football league season. All goals are modelled as Poisson. All teams start with the same prior (ie the same ability). The shaded region shows the region between the upper and lower quartiles.

Defensive strength 2017-18

Figure 1.14: Link, Caption: Defensive strength shown through season 2017-18. Premier football league

1.2.3 Gaussian Gaussian conjugacy

Inference for the mean of normally distributed data

Let y=(y1,y2,,yn) be a random sample of size n of a random variable Y with the Normal (μ,1τ) distribution, where τ=1σ2 is assumed known. The likelihood is better desribed using the precision τ.

f(y|μ,σ) =12πσexp{-(y-μ)22σ2}
=τ122πexp{-τ2(y-μ)2}
τ12exp{-τ2(y-μ)2}

Useful result

The quadratic form of the Gaussian

The following is a useful result for identifying normal distributions. If μ is a parameter with probability density function f(μ) satisfying

f(μ)exp(-12{Aμ2-2Bμ}),

iff μNormal (B/A,1/A).
PROOF ()

μ Normal (B/A,1/A)
-2log(f(μ)) =A(μ-BA)2
=Aμ2-2Bμ+B2A
f(μ) exp(-12{Aμ2-2Bμ})

Gaussian likelihood and prior

We now pair up the likelihood with a prior for μ.

Yi Normal (μ,1τ),i=1,2,,n (The likelihood)
μ Normal (μ0,1τ0) (The prior)

We now show that the conjugate prior for the mean of the normal using the result above

Gaussian likelihood and prior

π(μ|𝐲) L(μ;yi)π(μ)
exp(-τ2i=1n(yi-μ)2)exp(-τ02(μ-μ0)2)
exp(-τ2(nμ2-2μi=1nyi))exp(-τ02(μ2-2μμ0))
exp(-nτ+τ02μ2-μ(nτy¯+τ0μ0))
exp-nτ+τ02(μ-τny¯+τ0μ0nτ+τ0)2
μy,τ Normal (τny¯+τ0μ0nτ+τ0,1nτ+τ0) (1.1)

Bayes updating of μ, the mean

  1. Prior precision = τ0.

  2. Sample precision = nτ.

  3. Posterior Precision τpτ0+nτ.

  4. Posterior precision = prior precision + sample precision

  5. Prior mean = μ0.

  6. Sample mean = y¯.

  7. Posterior mean μpnτy¯+τ0μ0nτ+τ0

  8. =γ0μ0+γsy¯.

  9. γ0=τ0nτ+τ0=τ0τn+τ0τ

  10. γs=nτnτ+τ0=nn+τ0τ

  11. Posterior mean = weighted sum of the prior mean and the sample mean

  12. ESS of a Gaussian prior with respect to a Gaussian sample τ0τ.

Observations to be made A number of observations can be made:

  1. (1)

    Note that the effective sample size n0=τ0τ or τ0=n0τ.

  2. (2)

    Observe that ‘posterior precision’ = ‘prior precision’ + n× ‘precision of each data item’.

  3. (3)

    As n, then (loosely)

    μ|yNormal (y¯,σ2n)

    so that the prior has no effect in the limit.

  4. (4)

    As uncertainty contained in the prior increases or σ02, or equivalently the prior precision decreases or τ00, we again obtain

    μ|yNormal (y¯,σ2n)
  5. (5)

    Note that the posterior distribution depends on the data only through y¯ and not through the individual values of the yi themselves. Again, we say that y¯ is sufficient for μ.

Sequential inference when updating a mean

Let us take an example data set :Yearly suicides in Australia per 100000 individuals. We update the mean one observation at a time. Bayes theorem gets applied at each time step i=1,2, using the previous mean, μi-1, in the prior for the current mean μi.

μi Normal (μi-1,1τi-1) (The   Prior )
yi Normal (μi,1τ) (The   Likelihood )
μiyi Normal (μi-1τi-1+yiττi-1+τ,1τi-1+τ) (The   Posterior)
Normal (μi-1+(ττ+τi-1)Gain(yi-μi-1)error,1τ+τi-1)

1.2.4 Gaussian Gamma conjugacy

The normal distribution, mean assumed known

Let

Yτ Normal (μ,1τ)
f(yτ) τn2e-i=1N(yi-μ)2τ/2

Since μ is fixed then S=i=1N(yi-μ)2 will also be fixed. Thus the prior is τGamma (p,q).

π(τ) τp-1e-qτ. (The prior)
f(y|τ) τn2e-Sτ/2. (The likelihood)
π(τ|y,μ) π(τ)f(y|τ)
τp+n2-1e-τ(S2+q)
τy,μ Gamma (n2+p,i=1n(yi-μ)22+q). (The posterior)

The normal distribution (mean, μ, known) So the updates for the parameters of the Gamma distribution are

P p+n2
Q q+i=1n(yi-μ)22 (Updates for Gamma prior with a normal likelihood)

1.2.5 Gamma-Laplacian conjugacy

The Laplacian distribution (mean, μ, known)

The Laplacian distribution is useful for modeling distributions with heavy tails such as those found in stock-market returns. We model the returns as

yiμ,τLaplace (μ,1τ)

where has a density given by

f(yi|τ,μ)=τexp(-τ|yi-μ|),i=1,2,,n.

where τ is not the precision. The Laplacian has a variance given by Var (yiμ,τ)=2τ-2. In volatility modelling we are interested in how the variance of observations change (not the mean). We set the mean of the Laplacian to zero.

f(y|τ)τne-τi=1n|yi|.

When the prior is given as τGamma (p,q), The posterior becomes

π(τ|y) π(τ)f(y|τ)
τy Gamma (n+p,i=1n|yi|+q).

The Laplacian distribution

So the updates for the Gamma distribution are

P p+n
Q q+i=1n|yi|. (Updates for Gamma prior with a Laplacian likelihood)

Two stockmarket crashes

The 2008 and 2010 stock market crashes and how they unfolded

Lehman Brothers filed for bankruptcy on 15 September 2008, prompting a fall in the FTSE 100 of 4%. It was the beginning of a slump that by Christmas of that year had resulted in 23.4% being wiped off the value of Britain’s top 100 companies.

In a matter of minutes (May 2010) the Dow Jones index lost almost 9% of its value in a sequences of events that quickly became known as ”flash crash”. Hundreds of billions of dollars were wiped off the share prices of household name companies. But the carnage, which took place at a speed never before witnessed, did not last long. The market rapidly regained its composure and eventually closed 3% lower. In just 20 minutes, 2 Bn shares worth $56 Bn had changed hands.

The dataset

A dataset that illustrates these shocks is shown here

ISE100 Istanbul stock exchange national 100 index
SP Standard & poor’s 500 return index
DAX Stock market return index of Germany
FTSE Stock market return index of UK
NIK Stock market return index of Japan
BVSP Stock market return index of Brazil
EU MSCI European index
EM MSCI emerging markets index
Figure 1.15: Link, Caption: Returns from the world wide stock market data.

Sequential estimates of the volatility

Figure 1.16: Link, Caption: Sequential estimate of current standard deviation σ^i=2τi for each stock

Gamma or exponential likelihood with a gamma prior

Let X1,Xn be independent variables having the Gamma (k,θ) distribution, where k is known.

L(θ;x)θnkexp{-θΣxi}.

Now, studying this form, regarded as a function of θ suggests we could take a prior of the form

π(θ)θp-1exp{-qθ}

that is, θGamma (p,q), since then by Bayes’ Theorem

π(θx)θp+nk-1exp{-(q+Σxi)θ},

and so θ|xGamma (p+nk,q+xi). So the updates for the Gamma distribution are

P p+nk
Q q+xi (Updates for Gamma prior with a Gamma likelihood)