3 Multi-Parameter likelihoods

Examples of multi-parameter likelihoods

In this section we look at a number of statistical models having 2 parameters, and try to understand what can be learnt from looking at contour maps of the likelihood surface and how to calculate the MLEs. In principle the same methodology applies in higher dimensions, but exploring graphically in more than 2 dimensions becomes problematic. In each example it is assumed that X1,,Xn are independent random variables.

Example 3.5:  Normal Data, ctd.

Let XiN(μ,σ2), so that θ=(μ,σ) with parameter space Ω=(-,)×(0,). Obtain the likelihood function for θ and the MLE θ^.

Then,

L(θ)=i=1n1(2π)1/2σexp{-12σ2(xi-μ)2},

and so

(θ) = i=1n{-12log(2π)-logσ-12σ2(xi-μ)2}
= -n2log(2π)-nlogσ-12σ2i=1n(xi-μ)2.

First we calculate the MLEs; to do this we differentiate with respect to μ and σ and obtain:

μ(θ) = 1σ2i=1n(xi-μ)
σ(θ) = -nσ+1σ3i=1n(xi-μ)2.

For a maximum we need to obtain θ=θ^ such that

μ(θ^)=0

and

σ(θ^)=0.

Thus,

1σ^2i=1n(xi-μ^)=0,

and so μ^=x¯, and

-nσ^+1σ^3i=1n(xi-μ^)2=0,

giving σ^=(1ni=1n(xi-x¯)2)1/2.

Thus μ^ is the sample mean and σ^ is the sample standard deviation. This seems reasonable given μ and σ are the population mean and standard deviation respectively.

We will now look at this example in a number of ways to try to understand how the likelihood is determined by the sample data.

Figure 3.1: Link, Caption:
Figure 3.2: Link, Caption: Histograms of four simulated data sets from the N(μ,σ2) distribution, with μ=0,σ=1, together with contour plots of associated log-likelihood function of θ=(μ,σ).

Sample similarities and variation. Figure Figure 3.2 (Link) shows histograms of 4 data sets, each of size 25, simulated from the N(0,1) distribution. Thus, the true value of θ=(0,1). For each sample, contours of the log-likelihood function for θ have been plotted in steps of one unit down from the maximum, which is indicated by an asterix.

There are similarities in all four cases:

  • The shape of the contours is identical — reasonably elliptical with principal axes parallel to the coordinate axes

  • There is some asymmetry in the σ direction, which becomes more pronounced as we move further from the maximum.

  • The true parameter value has reasonably high likelihood and the MLE seems to be a quite adequate estimate of the true value.

We can also see how sample characteristics affect the likelihood. For example:

  • The top right sample seems to be rather more dispersed than the others; consequently, the likelihood gives greater values to large rather than small values of σ compared to the other examples.

  • The bottom-left sample seems slightly shifted to the left (of zero); thus, the likelihood is greater for negative rather than positive values of μ.

Although all of the samples are of equal size and drawn from the same population, variation across the samples leads to variation in some characteristics of the likelihood surface (the location of the maximum, the gradient of the contours), while other features (the overall shape) remain roughly the same.

R code to produce Figure 3.2 is given at the end of the example.

Figure 3.3: First Link, Second Link, Caption: Simulated data set from the N(μ,σ2) distribution, with μ=10,σ=2. True probability density function is plotted in bold on left; other curves on left represent normal densities based on different estimates of μ and σ. Right: log-likelihood contour plot.

Densities corresponding to parameter pairs. Now consider another sample of size n=25 simulated from the Normal distribution, but with θ=(10,2). A histogram of the data in shown in Figure Figure LABEL:fignlikloc1 (Link), together with the probability density function of the N(10,22) distribution from which they were simulated superimposed in bold-type.

The right of Figure Figure LABEL:fignlikloc1 (Link) plots contours of the log-likelihood function at steps of 1 unit down from the maximum.

Also identified on the contour plot are the maximum likelihood estimate (labelled ‘1’) and three other points (labelled ‘2’ to ‘4’). The probability density curves corresponding to these values have been superimposed on the histogram.

The curves representing the data have the following features:

  • Curve ‘1’ is the most consistent with the sample data, and is a quite reasonable estimate of the true model.

  • Curve ‘4’ is not much worse, as might be expected since the likelihood is not much below that of the maximum.

  • Curves ‘2’ and ‘3’ still have reasonably high likelihood, but we can see from the histogram why they have less support from the sample: curve ‘2’ has the mean about right, but the variance is greater than that of the data

  • curve ‘3’ seems to have the mean too low.

All of these conclusions are consistent with the information provided by the likelihood contours.

Figure 3.4: First Link, Second Link, Caption:
Figure 3.5: Link, Caption: Histograms of simulated data set of size n=10,25,100 respectively from the N(μ,σ2) distribution, with μ=0,σ=1, together with contour plots of associated log-likelihood function of θ=(μ,σ).

Effect of sample size, n. Figure 3.5 (Link) shows simulated datasets from the N(0,1) distribution, and corresponding log-likelihood contour plots for samples of sizes n=10,25,100. The contour plots are drawn on the same scales.

Thus, the main point to notice is that as the sample size increases, the contour gradients become greater. Thus, with more data, smaller regions around the maximum likelihood estimate θ^ are found to be ‘plausible’ values of θ, as judged in terms of likelihood.

This is what we would expect: with more data our inferences on θ should be more precise.

R Code for Figure 3.2 (Link)

# produce 2x4 graphics display
par(mfrow=c(2,4))

# function to calc. l(theta) for normal data
lliknorm<-function(x,mu,sigma){
     # sample size
     n<-length(x)

     # log-lhd equation (with additive consts removed)
     -n*log(sigma) - 1/(2*sigma^2) * sum((x-mu)^2)
} # end of function

# optional: will mean you get the same results as me:
set.seed(330)

#for loop to produce 4 variants
for(i in 1:4){
     # generate the data (25 normal deviates)
     x<-rnorm(25,mean=0,sd=1)

     # histogram of the data (breakpoints in hist forced)
     hist(x,xlim=c(-3,3),breaks=(-3):3,main=NULL)

     # grid of mu/sigma values to evaluate
     mu<-seq(from=-1.5,to=1.5,length=100)
     sigma<-seq(from=0.5,to=2.5,length=100)
     # evaluate llik for each (mu,sigma) pair
     llmat<-matrix(nrow=100,ncol=100)
     for(j in 1:100) for(k in 1:100){
          llmat[j,k]<-lliknorm(x,mu[j],sigma[k])
     }

     mle<-max(llmat)
     # where to draw the contours
     levelsdrawn<-mle-0:4

     # contour plot of log-lhd
     contour(mu,sigma,llmat,drawlabels=FALSE,levels=levelsdrawn,
ΨΨ    ylab="sigma",xlab="mu")
} # end of for loop

Recall that in MATH235 we studied the one-parameter Uniform[0,θ] distribution and found that although we could use likelihood techniques, the standard asymptotic results and techniques did not apply because the support of the distribution depends on the parameter, so it does not satisfy regularity condition R1. This is also true in the two-parameter case — the problem is non-regular.

Example 3.6:  Uniform Data.
Suppose XiUniform[a,b]. Hence, θ=(a,b) with Ω={-<a<b<}, and

f(x|θ)={(b-a)-1for axb0otherwise.

So,

L(θ)={(b-a)-nif axib for all i=1,,n0otherwise

and

(θ)={-nlog(b-a)if amin(x1,,xn) and max(x1,,xn)b-otherwise.

This is a model for which the log-likelihood is not differentiable at the MLE. However it still helps to consider the partial derivatives for amin(x1,,xn) and max(x1,,xn)b:

a(θ) = n/(b-a)
b(θ) = -n/(b-a)

Since (b-a)>0, these tell us that the log-likelihood is increasing as a increases and decreasing as b increases.

As the log-likelihood is - outside the region we are considering, we wish to choose a^ to be its largest value in this region, and b^ the smallest.

This gives a^=min(x1,,xn) and b^=max(x1,,xn).

Figure 3.6: Link, Caption: Histogram of simulated data set of size n=25 from the Uniform[a,b] distribution, with a=3,b=6, together with contour plot of associated log-likelihood function of θ=(a,b).

Figure 3.6 (Link) shows the histogram of a sample of size 25 simulated from the Uniform[3,6] distribution and a contour plot of the associated log-likelihood function; as usual, contours are plotted in steps of -1 from the maximum.

It is also clear from the figure that L (or ) will be maximized by taking (b-a) as small as possible, subject to the constraints that each of the Xi lie in the interval [a,b]. This point is plotted in Figure 3.6 (Link).

The contours are highly non-elliptical: they are straight lines parallel to the b-a line. (Note, however, that contours are truncated along the lines a=min{x1,,xn} and b=max{x1,,xn}, which again correspond to the limits of the parameter space having observed the data).

Example 3.7:  Gamma Distribution.

Let XiGamma(α,β), so θ=(α,β) and Ω=(0,)×(0,). Then

L(θ)=i=1n{βαxiα-1exp{-βxi}Γ(α)},

where Γ(α)=0sα-1exp{-s}𝑑s is the gamma function. It follows that

(θ) = i=1n{αlogβ+(α-1)logxi-βxi-logΓ(α)}
= nαlogβ+(α-1)i=1nlogxi-βi=1nxi-nlogΓ(α).

Thus,

α(θ)=nlogβ+i=1nlogxi-nγ(α).

Note γ(α)=α(logΓ(α))=Γ(α)Γ(α) (the digamma function) may look complicated but you can think of this simply as a function of α. Also

β(θ)=nαβ-i=1nxi=nαβ-nx¯

so to obtain θ^ we need to solve

α(θ^)=0 and β(θ^)=0.

Hence,

β^=α^x¯,

and

nlogβ^+i=1nlogxi-nγ(α)=0

so replacing β^ in terms of α^ gives

nlogα^-nlogx¯+i=1nlogxi-nγ(α)=0.

This latter equation can only be solved numerically.

With true parameter values of α=3 and β=1, Figure 3.7 (Link) shows a simulated sample of size n=25 and the associated log-likelihood contour plot.

In this case the maximum likelihood estimate of θ appears to be a good estimate of the true value, which itself lies inside the highest contour. The contours are near-elliptical, but the principal axes are not parallel to the coordinate axes.

Figure 3.7: Link, Caption: Histogram of simulated data set of size n=25 from the Gamma(α,β) distribution, with α=3,β=1, together with contour plots of associated log-likelihood function of θ=(α,β).

This has implications for inference, suggesting that estimated values of α that are quite different from α^ are plausible, provided estimates of β changes accordingly.

This is understandable if we think about the effect of the parameters α and β: α is a shape parameter, while β is a scale parameter. Quite similar models can be obtained by, say, increasing the scale parameter (and thus changing the dispersion of the data), provided the shape of the distribution is also distorted somewhat.

This aspect can be seen quite clearly from Figure Figure LABEL:figgamlikloc1 (Link). This is a similar exercise to that carried out for the normal distribution, but now based on the simulated Gamma data. The left panel shows a histogram of the data, with the true Gamma density shown in heavy bold-type.

The right panel plots the contours of the log-likelihood function, with three points marked. Point ‘1’ is the maximum likelihood estimate; point ‘2’ is on the edge of the first likelihood contour; point ‘3’ is closer to the maximum likelihood estimate than point ‘2’, but has a considerably lower likelihood.

Figure 3.8: First Link, Second Link, Caption: Simulated data set from the Gamma(α,β) distribution, with α=3,β=1. True probability density function is plotted in bold on left; other curves on left represent densities based on different estimates of α and β. Right: log-likelihood contour plot.

The corresponding model densities are also super-imposed on the left panel. We see that

  • Curves ‘1’ and ‘2’ are actually quite similar, as a consequence of the compensation between the parameters α and β;

  • However, the relatively small change from point ‘1’ to point ‘3’, has led to a substantially different curve, that seems to represent the data less well. This is a consequence of the fact that α and β have not moved in a way that allows their effects to compensate each other.

Note, in fact, that since the Gamma(α,β) distribution has expectation

E(X)=αβ=3

it follows that all points on the line β=α/3 have constant mean. Thus, it might be anticipated that the contours should be almost parallel to this line.

Example 3.8:  Simple Linear Regression.
We now look at a simple regression model: XiN(α+βzi,1) with (known) explanatory variables (z1,,zn). Thus, θ=(α,β) and Ω=(-,)×(-,).

The likelihood is

L(θ)=i=1n{1(2π)1/2exp{-12(xi-α-βzi)2}}

and

(θ) = i=1n{-12log(2π)-12(xi-α-βzi)2}
= -n2log(2π)-12i=1n(xi-α-βzi)2.

To obtain θ^:

α(θ) = i=1n(xi-α-βzi)
β(θ) = i=1n(xi-α-βzi)zi.

For a maximum we find θ=θ^ such that

α(θ^)=0 and β(θ^)=0.

Thus from the first equation we have

0 = i=1n(xi-α^-β^zi)
= nx¯-nα^-β^nz¯
α^ = x¯-β^z¯.

From the second equation we have

0 = i=1n(xi-α^-β^zi)zi
= i=1nxizi-α^i=1nzi-β^i=1nzi2
= i=1nxizi-(x¯-β^z¯)nz¯-β^i=1nzi2
β^ = i=1nxizi-nx¯z¯i=1nzi2-nz¯2.

This expression for β^ is identical to

β^=i=1n(xi-x¯)(zi-z¯)i=1n(zi-z¯)2

(Note: these coincide with the least-squares estimates; see MATH235).

Example 3.9:  Linear Regression with unknown variance.

Now consider σ unknown. Let XiN(α+βzi,σ2) with (known) explanatory variables (z1,,zn). Thus, θ=(α,β,σ) and Ω=(-,)×(-,)×(0,).

So,

L(θ)=i=1n{1(2π)1/2σexp{-12σ2(xi-α-βzi)2}}

and

(θ) = i=1n{-12log(2π)-logσ-12σ2(xi-α-βzi)2}
= -n2log(2π)-nlogσ-12σ2i=1n(xi-α-βzi)2.
Figure 3.9: Link, Caption: Simulated data (xi,zi) from regression model, together with contour plots of log-likelihood for (α,β),(α,σ) and (β,σ) respectively.

A simulated data set with n=25, zi=i for i=1,,25 and (α=10,β=0.2,σ=1) is shown in Figure 3.9 (Link).

Since this is a three-parameter problem, it’s not possible to contour the log-likelihood. Instead, we look at the three 2-parameter likelihoods obtained by fixing each parameter in turn at its true value and contouring the log-likelihood function of the other two.

Of course, with real data this cannot be done, since the true value of each parameter would be unknown; one possibility is to fix each parameter in turn at its maximum likelihood estimate. This issue will be considered in more detail in a Chapter 5.

The following observations can be made:

  • The (α,β) likelihood has near-elliptical contours, with principal axes not parallel to the coordinate axes. Again, the association between the individual parameter estimates is induced because the parameter values partially compensate for each other:

    E(Xi)=α+βzi

    so that increasing α can lead to the same mean if β is decreased accordingly.

    (if you had drew a regression line ‘by eye’ you could perhaps increase the intercept a little provided you decrease the gradient; the likelihood surface is doing precisely this).

  • The likelihoods for (α,σ) and (β,σ) are also near-elliptical, but with axes parallel to the coordinate axes.

Example 3.10:  Multinomial Data.

Consider an experiment which has three possible outcomes, with respective probabilities θ1,θ2,θ3 (subject to θ1+θ2+θ3=1). A concrete example is a football match, with outcomes win, lose, draw. Then suppose that n independent trials of the experiment are undertaken and that N=(N1,N2,N3) represents a count of how many times each outcome occurred (so that N1+N2+N3=n). The variable N is said to follow a multinomial distribution.

Simple probability arguments, which generalise those leading to the binomial distribution, lead to the probability mass function

p(n1,n2,n3|θ)=n!n1!n2!n3!θ1n1θ2n2θ3n3

where n=n1+n2+n3. Moreover, although θ is three-dimensional, the constraint that θ1+θ2+θ3=1 means there are only two independent parameters (the third is fixed once the other two are known).

Thus we may think of the likelihood as a function of just θ1 and θ2, say θ=(θ1,θ2). Hence, the likelihood based on a single realisation (n1,n2,n3) is

L(θ) = p(n1,n2,n3|θ)
θ1n1θ2n2(1-θ1-θ2)n3,

and so

(θ)=n1logθ1+n2logθ2+n3log(1-θ1-θ2),

with θ=(θ1,θ2) and Ω={(θ1,θ2):0θ11,0θ21,0θ1+θ21}.

A simulation with θ=(0.5,0.3,0.2) and n=25 led to the data N=(14,5,6). Contours of the corresponding log-likelihood function are plotted in Figure LABEL:figmulti.ps. The contours in this case are near-elliptical close to the maximum, but have a distorted shape away from the maximum due to the constraints on the parameters. The MLE also seems to give a quite reasonable estimate of θ.

Figure 3.10: Link, Caption: Contour plot of log-likelihood function for (θ1,θ2) for data simulated from multinomial model with θ=(0.5,0.3,0.2).

Example 3.11:  Reparameterised Regression.

The feature observed in the previous example, of an association between the parameters α and β in the (log)likelihood surface, is rather problematic, because it means that any statements made about estimates of α will always depend on the associated estimate of β.

It is much more desirable to have inferences about two different parameters disentangled, so that what we say about one is independent of what we say about the other.

This requires that the (log)likelihood surface is parallel to the coordinate axes as was the case for each of the other parameter pairs.

In many situations, careful re-parametrisation of a model can overcome this problem. For example, we anticipated that the problem in the regression example was due to a compensation between α and β as the model attempted to match the mean of the data.

This suggests the following reformulation of the problem: XiN(α*+β*(zi-z¯),σ2), with z¯=i=1nzi/n. Thus, the model is identical to that of the previous example, with

β*=β and α*=α+βz¯,

θ=(α*,β*,σ) and Ω=(-,)×(-,)×(0,). Now,

E(Xi)=α+βzi=α*+β*(zi-z¯)

so that α* represents the value of E(Xi) when zi=z¯.

Now, to see why this improves things, think again about drawing a regression line ‘by eye’, but this time, instead of choosing the intercept and then the gradient, you choose a value for α* (i.e. the centre of the cloud of points), and then the gradient. What you should find is that you can (more or less) make a decision about the gradient which is unaffected by your decision about the value of α*. Likelihood-based techniques benefit from the re-parametrisation for the same reasons.

Working with the new parameters,

(θ)=-n2log(2π)-nlogσ-12σ2i=1n[xi-α*-β*(zi-z¯)]2.

To calculate the MLEs for the new parameters, we can proceed as before. Thus

α*(θ) = σ-2i=1n(xi-α*-β*(zi-z¯))
β*(θ) = σ-2i=1n(zi-z¯)(xi-α*-β*(zi-z¯))

Now

α*(θ)=σ-2i=1n(xi-α*),

as i=1n(zi-z¯)=0. Thus setting this partial derivative to 0 gives α^*=x¯. Setting the other partial derivative to 0, and substituting for α*^ gives

0=σ-2i=1n(zi-z¯)(xi-x¯-β^*(zi-z¯))

so

β^*=i=1n(xi-x¯)(zi-z¯)i=1n(zi-z¯)2

as before.

Note that these MLEs could also be solved from the MLEs from the earlier example using the multiparameter extension of the invariance principle: β*=β so β^*=β^; and α*=α+βz¯ so α^*=α^+β^z¯. But as α^=x¯-βz¯, then α^*=x¯.

The corresponding contour plot of the log-likelihood is shown in Figure LABEL:figreparreg.ps. Notice now that the contours have axes seemingly parallel to the coordinate axes.

Figure 3.11: Link, Caption: Contour plot of log-likelihood for (α*,β*) in re-parametrised regression example.

Comment: the information in the (α*,β*) likelihood is exactly that of the (α,β) likelihood, but viewed from a different angle. The important thing, however, is that from the new angle, statements about each of the two parameters can be made regardless of the value of the other parameter.

Example 3.12:  Logistic Regression.
In a binomial regression experiment we have XiBinomial(1,pi), where pi is some function of a covariate zi associated with individual i. To make things more concrete (if overly-simplistic), suppose Xi is 1 or 0 according to whether individual i passes an exam or not, and zi is the amount of time individual i studies for the exam.

We might anticipate that the greater the value of zi, the greater the value of pi, the probability of passing. But using a relationship like

pi=α+βzi

is problematic, since the values of pi have to lie between 0 and 1. Thus, it is usual to suppose that pi is related to zi non-linearly in such a way that the constraints on pi are respected.

One relationship often used is the logistic relationship:

pi=exp{α+βzi}1+exp{α+βzi}.

Thus, if β is positive, pi increases with zi; if β=0, then pi is constant (equal to eα/(1+eα)) for all individuals; if β is negative, then pi decreases with zi. In each case, though, pi lies between 0 and 1.

The likelihood for individual i is

pixi(1-pi)1-xi={pi if xi=11-pi if xi=0

Therefore, the overall likelihood for θ is

L(θ)=i=1npixi(1-pi)1-xi

so

(θ) = i=1n{xilogpi+(1-xi)log(1-pi)}
= Σ1logpi+Σ0log(1-pi)

where 1 is a sum over the individuals for whom xi=1 and 0 is a sum over the individuals for whom xi=0.

A simulated set of data with n=25 and β=0.2 is shown in Figure 3.12 (Link) — notice that since β>0, the chance of xi being 1 rather than 0 increases with zi.

Figure 3.12: Link, Caption: Simulated data from binary regression example.
Figure 3.13: First Link, Second Link, Caption: Contour plot of log-likelihood surface for (α,β) in binary regression example. Left: (α,β); Right: (α*,β*), after re-parameterisation

A contour surface of the log-likelihood for these data is shown in Figure LABEL:figlog.ps (left).

  • As usual, the maximum likelihood estimate provides a reasonable estimate of θ.

  • The contours are near-ellipses, but with substantial association between α and β.

As in the linear regression example, this can be partly overcome by the re-parametrisation

β*=β and α*=α+βz¯.

This leads to the likelihood surface shown in Figure LABEL:figlog (Link). There is indeed some improvement, but unlike the linear regression example, some association between the two parameters remains.