1 Modelling and Statistical Inference Philosophical Aside: Does God exist?Maximum likelihood estimation and relative likelihood

The Likelihood Function

suppose we have some data $\vec{x}$ , a realisation of some random variables $\vec{X}$ that we assume have some (joint) distribution or model $\vec{f}(\cdot|\theta)$ for the data $\vec{x}$ . This is a fully generalised description: so far we are not assuming that the data are independent or identically distributed, and we may have a vector of parameters.

The fully general definition of the likelihood function is any function $L(\theta)$ such that

L(\theta)\propto\vec{f}(\vec{x}|\theta),

viewed as a function of $\theta$ . Importantly, this does not define a distribution for $\theta$ , as $\theta$ is on the wrong side of the conditioning. It defines a distribution for the random variable $\vec{X}$ for each fixed value of $\theta$ .

For much of this course, we will assume that $\vec{x}$ consists of $n$ independent and identically distributed (IID) realisations, i.e. $\vec{x}=(x_{1},\ldots,x_{n})$ with each $x_{i}$ a realisation of the same random variable:

X_{i}\sim f(\cdot|\theta),\ i=1,\ldots,n.

In this special case, we can write the likelihood function as being proportional to the product of the densities of the observations:

L(\theta)\propto\vec{f}(\vec{x}|\theta)=\prod_{i=1}^{n}f(x_{i}|\theta).

Note here $\vec{f}$ is being used to denote a joint density and $f$ a marginal density.

Recall that the proportionality in the definition allows us to discard any multiplicative constants that do not involve $\theta$ . The set of possible $\theta$ values is $\Theta$ , with $\Theta$ called the parameter space. If $\theta\notin\Theta$ then $L(\theta)=0$ .

It is often more useful to work with the log–likelihood function, defined by:

\ell(\theta)=\log L(\theta)=\sum_{i=1}^{n}\log f(x_{i}|\theta),

with proportionality constants of $L(\theta)$ translated into an additive constant.

Note 1: Really both the likelihood and log-likelihood are functions of both $\theta$ and $\vec{x}$ , but usually we drop $\vec{x}$ as the data do not change.

Note 2: Sometimes we are interested in how $L(\theta)$ and $\ell(\theta)$ change over different realisations of the random variables $\vec{X}=(X_{1},\ldots,X_{n})$ . Then we use $L(\theta;\vec{X})$ and $\ell(\theta;\vec{X})$ to show this dependence, with these being random functions of $\theta$ as they vary with $\vec{X}$ .