Home page for accesible maths 10 Covariate selection 6.4.1 Examples

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

6.4 Multiple linear regression

Multiple linear regression follows exactly the same concept as simple linear regression, except that the expected value of the response variable may depend on more than one explanatory variable.

TheoremExample 6.4.1 Birthweight cont.

As well as information on birthweight and gestational age, we know the gender of each child. This is illustrated on Figure 6.4. A separate linear relationship between birthweight and gestational age has been fitted for males and females.

(a)

Do males and females gain weight at different rates?
(b)

Do we need both gestational age and gender to explain variability in birth weights, or is one of these sufficient?

Fig. 6.4: Birthweight (grams) against gestational age (weeks), split by gender: separate straight line relationships added (bottom).

TheoremExample 6.4.2 Gas consumption cont.

A little under half way through the experiment, cavity wall insulation was installed in the house. This is illustrated in Figure 6.5. By including this information in the model, we can assess whether or not insulation alters gas consumption. In this case we might suspect that both separate intercepts and separate gradients are required for observations before insulation and after insulation.

Fig. 6.5: Gas consumption (1000’s cubic feet) against outside temperature ( ${}^{\circ}$ C), before (blue) and after (red) cavity wall insulation: separate straight line relationships added.

Remark.

How can the point at which the two lines intersect be interpreted?

TheoremExample 6.4.3 Brain weight cont.

As well as brain and body weights, we also have available, for each species, total sleep (hours per day) and the period of gestation (time spent in the womb before birth, in days). Figure 6.6 suggests linear relationships between both of the variables total sleep and log gestational period and log brain weight.

(a)

How can we compare the three models for log brain weight shown in Figure 6.3 and Figure 6.6?
(b)

Are all three relationships significant?
(c)

Which of the three explanatory variables is the most important in explaining variability amongst brain weights?
(d)

How many of these explanatory variables should we use?

Fig. 6.6: log Brain weight ( $g$ ) against total daily sleep (hrs), left, and log gestational period, right, for 58 species of mammals (four species removed due to no sleep data).

TheoremExample 6.4.4 Cereal prices

Regression models can be useful in economics and finance. We investigate global commodity price forecasts for various cereals from 1995–2015.

Fig. 6.7: Forecasts of annual prices for maize against barley (left) and wheat (right) for 1995–2015. Prices are in dollars per tonne.

Annual prices (dollars per tonne) are available for maize, barley and wheat, made available by the Economist Intelligence Unit and downloaded from http://datamarket.com/. Figure 6.7 show maize prices against both wheat and barley prices.

(a)

Which of the explanatory variables best explains changes in maize prices?
(b)

What happens if we include both barley and wheat in the model for maize prices?
(c)

How well do the three possible regression models (barley only, wheat only and both barley and wheat) fit the data?

The multiple linear regression model has a very similar definition to the simple linear regression model, except that
1. missingi.
  
  Each individual has a single response variable $Y_{i}$ , but a missingmissingvector
of explanatory variables $(x_{i,1},\ldots,x_{i,p})$ . The number of explanatory variables is denoted by $p$ .
There are $p$ regression coefficients $\beta_{1},\beta_{2},\ldots,\beta_{p}$ , where $\beta_{j}$ describes the effect of the $j$ -th explanatory variable on the expected value of the response.

Definition (Multiple linear regression model).

For $i=1,\ldots,n$ ,

\displaystyle Y_{i}=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\ldots+\beta_{p}x_{i,p}+% \epsilon_{i}.

The residuals $\epsilon_{1},\ldots,\epsilon_{n}$ satisfy exactly the same assumptions as for the simple linear regression model. That is the residuals are independent and identically distributed, with a normal distribution, i.e. for $i=1,\ldots,n$ ,

\displaystyle\epsilon_{i}\sim\operatorname{N}(0,\sigma^{2}).

An informal definition of the multiple linear model is {mdframed}

–

$Y_{i}\sim\operatorname{N}\left(\sum_{j=1}^{p}\beta_{j}x_{i,j},\sigma^{2}\right)$ , $i=1,\ldots,n$ ;
–

$Y_{1},\ldots,Y_{n}$ are independent.

Remark.

To include an intercept term in the model, set $x_{i,1}=1$ for all individuals $i=1,\ldots,n$ , gives

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}.1+\sum_{j=2}^{p}\beta_{j}x_{i,j}=% \beta_{1}+\sum_{j=2}^{p}\beta_{j}x_{i,j}.

Then $\beta_{1}$ is the intercept.

Later in the course it will be useful to write our models using the following matrix notation.

The response vector ${Y=(Y_{1},\ldots,Y_{n})^{missing}\prime}$

The design matrix $X$ is an $n\times p$ matrix whose columns correspond to explanatory variables and whose rows correspond to subjects. That is ${X_{missing}i,j}$ denotes the value of the $j$ -th explanatory variable for individual $i$ . If an intercept term is included, then the first column of $X$ is a column of 1’s.

The residual vector ${\epsilon=(\epsilon_{1},\ldots,\epsilon_{n})^{missing}\prime}$

The $p\times 1$ vector of coefficients is missing ${\beta=(\beta_{1},\ldots,\beta_{p})^{missing}\prime}$

Then we can write the multiple linear regression model as follows,

1

$Y=X\beta+\epsilon$ where $\epsilon\sim\operatorname{MVN}_{n}(0,\sigma^{2}I)$ and $I$ is the $n\times n$ identity matrix.
2

Informally, $Y\sim\operatorname{MVN}_{n}(X\beta,\sigma^{2}I)$ .

For the remainder of the course we will not distinguish between simple and multiple linear regression, since the former is a special case of the latter. Instead we refer to linear regression.

Remark.

Because of the normality assumption on the residuals the full title of the model is the normal linear regression model.

Response variables

From the informal definition of the linear regression model, the response variable $Y_{i}$ should be continuous and follow a normal distribution. It only makes sense to apply a linear regression model to data for which the response variable satisfies these criteria.

The assumption of normality can be verified with a normal QQ plot. If the response if continuous and non-normal, a transformation may be used to transform to normality. For example, we might take the log or square root of the responses.

Math333 will introduce the concept of the generalised linear model, in which the normality assumption is relaxed to cover a wide family of distributions, including the Poisson and binomial.

Factors

Explanatory variables may be continuous or discrete, qualitative or quantitative.

Definition.

1

A covariate is a quantitative explanatory variable.
2

A factor is a qualitative explanatory variable. The possible values for the factor are called levels. For example, gender is a two-level factor with two levels: male and female.

Factors are represented by indicator variables in a linear regression model. For a $p$ -level factor, $p$ indicator variables are created. For individual $i$ , the indicator variable for level $j$ takes the value 1 if that individual has level $j$ of the factor; otherwise it takes the value zero.

An example of a two-level factor is gender. To include gender as an explanatory variable we create two indicator variables $x_{i,1}$ , to show whether individual $i$ is male, and $x_{i,2}$ , to show whether individual $i$ is female. Then

\displaystyle x_{i,1}=\left\{\begin{array}[]{ll}1&\quad\text{if individual}~{}% i~{}\text{is male}\\ 0&\quad\text{if individual}~{}i~{}\text{is female}\end{array}\right.

and

\displaystyle x_{i,2}

\displaystyle=\left\{\begin{array}[]{ll}1&\quad\text{if individual}~{}i~{}% \text{is female}\\ 0&\quad\text{if individual}~{}i~{}\text{is male}\end{array}\right.

6.4.1 Examples

TheoremExample 6.4.5 Birth weights cont.

The response variable $Y_{i}$ is birth weight. There are two explanatory variables, gender (factor) and gestational age (continuous). Let $x_{i,1}$ and $x_{i,2}$ be indicator variables for males and females respectively and $x_{i,3}$ be gestational age.

One possible model is

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i% ,3}

(6.2)

This assumes a different intercept for males ( $\beta_{1}$ ) and females ( $\beta_{2}$ ), but a common slope for gestational age ( $\beta_{3}$ ). It does not include an overall intercept term - we will see later why this is.

A second possible model has a common intercept, but allows for separate slopes for males and females; this is an interaction between gender and age.

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}x_{i,3}+\beta_{3}x_{i% ,2}x_{i,3}

(6.3)

What are the interpretations of $\beta_{1}$ , $\beta_{2}$ and $\beta_{3}$ ?

1

$\beta_{1}$ is the expected birth weight of a baby born at 0 weeks gestation, regardless of gender;
2

$\beta_{2}$ is the expected change in birth weight for a male with every extra week of gestation;
3

$\beta_{3}$ is the expected change in birth weight for a female with every extra week of gestation.

The design matrix for model 6.2 has three columns; the first is the indicator for males, the second the indicator column for females and the third contains gestational age.

Describe the design matrix for model 6.3.

The design matrix for model 6.3 has three columns. The first is a column of 1’s for the intercept. The second is the product of the indicator variable for males and gestational age. The third is the product of the indicator variable for females and gestational age.

A third possible model, combining the first two, includes separate intercepts and separate slopes for the two genders.

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i% ,1}x_{i,3}+\beta_{4}x_{i,2}x_{i,3}

(6.4)

A plot of all three model fits is shown in Figure 6.8, Figure 6.9 and Figure 6.10.

Fig. 6.8: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.2.

Fig. 6.9: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.3.

Fig. 6.10: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.4.

Remark.

How can we choose which of models 6.2, 6.3 and 6.4 fits the data best? Intuitively model 6.3 seems sensible - all babies start at the same weight, but gender may affect the rate of growth. However, since our data only covers births from 35 weeks gestation onwards, we should only think about the model which best reflects growth during this period.

We will look at issues of model selection later.

TheoremExample 6.4.6 Gas consumption

Continuing example 6.4.2; the response $Y_{i}$ is gas consumption. Two explanatory variables, outside temperature (continuous) and before/after cavity wall insulation (factor).

Let $x_{i,1}$ be outside temperature and $x_{i,2}$ be an indicator variable for after cavity wall insulation, i.e.

\displaystyle x_{i,2}

\displaystyle=\left\{\begin{array}[]{ll}1&\quad\text{if observation}~{}i~{}% \text{is after insulation}\\ 0&\quad\text{if observation}~{}i~{}\text{is before insulation}\end{array}\right.

The modelling approach is as follows. Example 6.2.2 gave a regression on outside temperature only,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}.

(6.5)

Figure LABEL:gas_scatter2gas_scatter3 suggests that the rate of change of gas consumption with outside temperature was altered following insulation. There is also evidence of a difference in intercepts before and after insulation. We could include this information in the model as follows,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}+% \beta_{4}x_{i,1}x_{i,2}.

(6.6)

What are the interpretations of $\beta_{1}$ and $\beta_{3}$ in this model?

1

$\beta_{1}$ is the expected gas consumption when the outside temperature is 0 ${}^{\circ}$ C, before insulation;
2

$\beta_{2}$ is the change in gas consumption for a 1 ${}^{\circ}$ C change in outside temperature, before insulation.

To interpret $\beta_{3}$ and $\beta_{4}$ :

1

$\beta_{1}+\beta_{3}$ is the expected gas consumption insulation when the outside temperature is 0 ${}^{\circ}$ C, after insulation;
2

$\beta_{2}+\beta_{4}$ is the change in gas consumption for a 1 ${}^{\circ}$ C change in outside temperature, after insulation.

$\beta_{3}$ tells us about the change in intercept following insulation; $\beta_{4}$ tells us how the relationship between gas consumption and outside temperature is altered following insulation.

Examples 6.4.5 and 6.4.6 show two different ways of including factors in linear models. In Example 6.4.5, indicator variables for all factors are included, but there is no intercept. In Example 6.4.6, there is an intercept term, but the indicator variable for only one of the two levels of the factor is included.

In general, we include an intercept term and indicator variables for $p-1$ levels of a $p$ -level factor. This ensures that the columns of the design matrix $X$ are linearly independent - even if we include two or more factors in the model.

For interpretation, one level of the factor is set as a ‘baseline’ (in our example this was before insulation) and the regression coefficients for the remaining levels of the factor can be used to report the additional effect of the remaining levels on top of the baseline.

6.5 Summary

{mdframed}

1

Linear regression models provide a tool to estimate the linear relationship between one or more explanatory variables and a response variable.
2

Such models may be used for explanatory or predictive purposes. Care must be taken when extrapolating beyond the range of the observed data.
3

A simple linear regression model, which includes an intercept term and a single explanatory variable, is a special case of the multiple regression model
4

A number of assumptions need to be made when the model is defined:

$Y_{i}=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\ldots,\beta_{p}x_{i,p}+\epsilon_{i},$

$i=1,\ldots,n$ , where the residuals $\epsilon_{1},\ldots,\epsilon_{n}$ are independent and identically distributed with

$\epsilon_{i}\sim\operatorname{Normal}(0,\sigma^{2}),$

$i=1,\ldots,n$ .
5

In practical terms, the response variable $Y_{i}$ should be continuous with a distribution close to Normal. A transformation may be used to achieved the normality part.
6
Explanatory variables may be
- 1
  
  Quantitative (covariates);
- 2
  
  Qualitative (factors).
7

Factors are represented in the model using an indicator function for each level of the factor.

Chapter 7 Linear regression - fitting

The linear regression model has three unknowns:

1

The regression coefficients, $\beta=(\beta_{1},\ldots,\beta_{p})^{\prime}$ ;
2

The residual variance, $\sigma^{2}$ ;
3

The residuals $\epsilon=(\epsilon_{1},\ldots,\epsilon_{n})^{\prime}$ .

In this chapter we will look at how each of these components can be estimated from a sample of data.

7.1 Estimation of regression coefficients $\beta$

We shall use the method of least squares to estimate the vector of regression coefficients $\beta$ . For a linear regression model, this approach to parameter estimation gives the same parameter estimates as the method of maximum likelihood, which will be discussed in weeks 7–10 of this course.

The basic idea of least squares estimation idea is to find the estimate $\hat{\beta}$ which minimises the sum of squares function

\displaystyle S(\beta)=\sum_{i=1}^{n}(y_{i}-\beta_{1}x_{i,1}-\ldots-\beta_{p}x% _{i,p})^{2}.

(7.1)

We can rewrite the linear regression model in terms of the residuals as

\displaystyle\epsilon_{i}=Y_{i}-\beta_{1}x_{i,1}-\ldots-\beta_{p}x_{i,p}

By replacing $Y_{i}$ with $y_{i}$ , $S(\beta)$ can be interpreted as the sum of squares of the observed residuals. In general, the sum of squares function $S(\beta)$ is a function of $p$ unknown parameters, $\beta_{1},\ldots,\beta_{p}$ . To find the parameter values which minimise the function, we calculate all $p$ first-order derivatives, set these derivatives equal to zero and solve simultaneously.

Using definition (7.1), the $j$ -th first-order derivative is

\displaystyle\frac{\delta S}{\delta\beta_{j}}=-2\sum_{i=1}^{n}x_{i,j}(y_{i}-% \beta_{1}x_{i,1}-\ldots-\beta_{p}x_{i,p}).

(7.2)

We could solve the resulting system of $p$ equations by hand, using e.g. substitution. Since this is time consuming we instead rewrite our equations using matrix notation. The $j$ -th first-order derivative corresponds to the $j$ -th element of the vector

\displaystyle-2X^{\prime}(y-X\beta).

Thus to find $\hat{\beta}$ we must solve the equation,

\displaystyle-2X^{\prime}(y-X\hat{\beta})=0.

Multiplying out the brackets gives

\displaystyle-2X^{\prime}y+2X^{\prime}X\hat{\beta}=0

which can be rearranged to

\displaystyle X^{\prime}X\hat{\beta}=X^{\prime}y.

Multiplying both sides by $(X^{\prime}X)^{-1}$ gives the least squares estimates for $\hat{\beta}$ ,

{mdframed}

\displaystyle\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y.

This is one of the most important results of the course!

Remark 1.

In order for the least squares estimate (7.3) to exist, $(X^{\prime}X)^{-1}$ must exist. In other words, the $p\times p$ matrix $X^{\prime}X$ must be non-singular,

1

$X^{\prime}X$ is non-singular iff it has linearly independent columns;
2

This occurs iff $X$ has linearly independent columns;
3

Consequently, explanatory variables must be linearly independent;
4
This relates back to the discussion on factors in Section 6.4. Linear dependence occurs if
- 1
  
  An intercept term and the indicator variables for all levels of a factor are include in the model; since the columns representing the indicator variables sum to the column of 1’s.
- 2
  
  The indicator variables for all levels of two or more factors are included in a model; since the columns representing the indicator variables sum to the column of 1’s for each factor.
Consequently it is safest to include an intercept term and indicator variables for $p-1$ levels of each $p$ -level factor.

Remark 2.

If you want to bypass completely the summation notation used above, the sum of squares function (7.1) can be written as

\displaystyle S(\beta)=(y-X\beta)^{\prime}(y-X\beta)=y^{\prime}y-\beta^{\prime% }X^{\prime}y-y^{\prime}X\beta+\beta^{\prime}X^{\prime}X\beta.

(7.3)

1

Now $\beta^{\prime}X^{\prime}y=(y^{\prime}X\beta)^{\prime}$ and since both $\beta^{\prime}X^{\prime}y$ and $y^{\prime}X\beta$ are scalars (can you verify this?) we have that $\beta^{\prime}X^{\prime}y=y^{\prime}X\beta$ .
2

Hence,

$\displaystyle S(\beta)=y^{\prime}y-2\beta^{\prime}X^{\prime}y+\beta^{\prime}X^% {\prime}X\beta.$
3

Differentiating with respect to $\beta$ gives the vector of first-order derivatives

$\displaystyle-2X^{\prime}y+2X^{\prime}X\beta=-2X^{\prime}(y-X\beta)$

as before.

Remark 3.

To prove that $\hat{\beta}$ minimises the sum of squares function we must check that the matrix of second derivatives is positive definite at $\hat{\beta}$ .

1

This is the multi-dimensional analogue to checking that the second derivative is positive at the minimum of a function in one unknown.
2

Returning once more to summation notation,

$\displaystyle\frac{\delta^{2}S}{\delta\beta_{k}\beta_{j}}=2\sum_{i=1}^{n}x_{i,% j}x_{i,k}.$
3

This is the $(j,k)$ -th element of the matrix $X^{\prime}X$ . Thus the second derivative of $S(\beta)$ is $X^{\prime}X$ .
4

To prove that $X^{\prime}X$ is positive definite, we must show that $z^{\prime}X^{\prime}Xz>0$ for all non-zero vectors $z$ .
5

Since $z^{\prime}X^{\prime}Xz$ can be written as the product of a vector and its transpose, $(Xz)^{\prime}Xz$ , the result follows immediately.

7.1.1 Examples

TheoremExample 7.1.1 Birth weights cont.

We return to the birth weight data in example 6.2.1. The full data set is given in Table 7.1. We will fit the simple linear regression for birth weight $Y_{i}$ with gestational age $x_{i}$ as explanatory variable,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i}

The response vector and design matrix are

\displaystyle y=\begin{bmatrix}2968\\ 2795\\ 3163\\ 2925\\ \vdots\\ 2875\\ 3231\\ \end{bmatrix}

and

\displaystyle X=\begin{bmatrix}1&40\\ 1&38\\ 1&40\\ 1&35\\ \vdots\\ 1&39\\ 1&40\\ \end{bmatrix}.

Obtain the least squares estimate $\hat{\beta}$ .

To find $\hat{\beta}$ we use the formula

\displaystyle\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y

From above,

1

$X^{\prime}X=\begin{bmatrix}24&925\\ 925&35727\\ \end{bmatrix},$
2

$(X^{\prime}X)^{-1}=\begin{bmatrix}19.6&-0.507\\ -0.507&0.0132\\ \end{bmatrix},$
3

$X^{\prime}y=\begin{bmatrix}71224\\ 2753867\end{bmatrix}.$

Therefore,

\displaystyle\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}y=\begin{bmatrix}-1485\\ 116\end{bmatrix}.

The fitted model for birth weight, given gestational age at birth is,

\displaystyle\mathbb{E}[Y_{i}]=-1485+116x_{i}

We can interpret this as follows,

1

For every additional week of gestation, expected birth weight increases by $116$ grams.
2

If a child was born at zero weeks of gestation, their birth weight would be $-1485$ grams.

Why does the second result not make sense?

Because the matrices involved can be quite large, whether due to a large sample size $n$ , a large number $p$ of explanatory variables, or both, it is useful to be able to calculate parameter estimates using computer software. In R, we can do this ‘by hand’ (treating R as a calculator), or we can make use of the function lm which will carry out the entire model fit. We illustrate both ways.

TheoremExample 7.1.2 Birth weight model in R

Load the data set bwt into R. To obtain the size of the data set,

⬇

> dim(bwt)

[1] 24 3

This tells us that there are 24 subjects and 3 variables. The variables are,

⬇

> names(bwt)

[1] "Age" "Weight" "Gender"

To fit the simple linear regression of the previous example ‘by hand’,

1

Set up the design matrix,

⬇

> X <- matrix(cbind(rep(1,24),bwt$Age),ncol=2)
2

Calculate $\hat{\beta}$ using equation (7.3),

⬇

> beta <- solve(t(X)%*%X)%*%t(X)%*%bwt$Weight
3

View results

⬇

> beta

[,1]

[1,] -1484.9846

[2,] 115.5283

To fit the same model using lm,

1

Specify the required model. Note R assumes that you want to include an intercept term, so this need not be explicitly included,

⬇

> bwtlm <- lm(bwt$Weight~bwt$Age)
2

To view the estimates $\hat{\beta}$ ,

⬇

> bwtlm$coefficient

(Intercept) bwt$Age

-1484.9846 115.5283

Child	Gestational Age (wks)	Birth weight (grams)	Gender
1	40	2968	M
2	38	2795	M
3	40	3163	M
4	35	2925	M
5	36	2625	M
6	37	2847	M
7	41	3292	M
8	40	3473	M
9	37	2628	M
10	38	3176	M
11	40	3421	M
12	38	2975	M
13	40	3317	F
14	36	2729	F
15	40	2935	F
16	38	2754	F
17	42	3210	F
18	39	2817	F
19	40	3126	F
20	37	2539	F
21	36	2412	F
22	38	2991	F
23	39	2875	F
24	40	3231	F

Table 7.1: Gestational age at birth (weeks), birth weight (grams) and gender of 24 individuals.

TheoremExample 7.1.3 Gas consumption cont.

Recall example 6.4.2 in which we investigated the relationship between gas consumption and external temperature. To measure the effect of changes in the external temperature on gas consumption, we fit the multiple linear regression model 6.6. We will allow a different relationship between gas consumption and outside temperature before and after the installation of cavity wall insulation. The model has four regression coefficients

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}+% \beta_{4}x_{i,1}x_{i,2}

Here $x_{i,1}$ is outside temperature and $x_{i,2}$ is an indicator variable taking the value 1 after installation.

The data are shown in Table 7.2.

To estimate the parameters by hand, we first set up the response vector and design matrix,

\displaystyle y=\begin{bmatrix}7.2\\ 6.9\\ 6.4\\ \vdots\\ 2.6\\ 4.8\\ \vdots\\ 3.5\\ 3.4\\ \end{bmatrix}

and

\displaystyle X=\begin{bmatrix}1&-0.8&0&0\\ 1&-0.7&0&0\\ 1&0.4&0&0\\ \vdots&\vdots&\vdots&\vdots\\ 1&10.2&0&0\\ 1&-0.7&1&-0.7\\ \vdots&\vdots&\vdots&\vdots\\ 1&4.7&1&4.7\\ 1&4.9&1&4.9\\ \end{bmatrix}.

Since $X^{\prime}X$ will be a $4\times 4$ matrix, it is easier to do our calculations in R. First load the data set gas.

⬇

> names(gas)

[1] "Insulate" "Temp" "Gas" "Insulate2"

1

Insulate contains Before or After to indicate whether or not cavity wall insulation has taken place;
2

Temp contains outside temperature;
3

Gas contains gas consumption;
4

Insulate2 contains a 0 or 1 to indicate before (0) or after (1) cavity wall insulation.

To set up the design matrix

⬇

> X <- matrix(cbind(rep(1,44),gas$Temp,gas$Insulate2,

gas$Insulate2*gas$Temp),ncol=4)

Then to obtain $\hat{\beta}$ ,

⬇

> beta <- solve(t(X)%*%X)%*%(t(X)%*%gas$Gas)

> beta

[,1]

[1,] 6.8538277

[2,] -0.3932388

[3,] -2.2632102

[4,] 0.143612

Thus the fitted model is

\displaystyle\mathbb{E}[Y_{i}]=6.85-0.393x_{i,1}-2.26x_{i,2}+0.144x_{i,1}x_{i,2}

1

Before cavity wall insulation, when the outside temperature is 0 ${}^{\circ}$ C, the expected gas consumption is 6.85 1000’s cubic feet.
2

Before cavity wall insulation, for every increase in temperature of 1 ${}^{\circ}$ C, the expected gas consumption decreases by 0.393 1000’s cubic feet.
3

After cavity wall insulation, for every increase in temperature of 1 ${}^{\circ}$ C, the expected gas consumption decreases by 0.249 1000’s cubic feet.

Where does the figure 0.249 come from?

Substitute $x_{i,2}=1$ into the fitted model; -0.393+0.144 is the overall rate of change of gas consumption with temperature.

What is the expected gas consumption after cavity wall insulation, when the outside temperature is $0^{\circ}$ C?

\displaystyle 6.85-2.26=4.59\text{~{}thousand cubic feet}.

We can alternatively fit this model in R using lm,

⬇

> gaslm <- lm(gas$Gas~gas$Temp*gas$Insulate2)

> coefficients(gaslm)

(Intercept) gas$Temp

6.8538277 -0.3932388

gas$Insulate2 gas$Temp:gas$Insulate2

-2.2632102 0.143612

Remark.

We have used an interaction term * between two explanatory variables. Then R includes an intercept, a term for each of the explanatory variables and the interaction between the two explanatory variables. We will look at interactions in more detail later.

Remark.

The model suggests that cavity wall insulation decreases gas consumption when the outside temperature is 0 ${}^{\circ}$ C. Further, the rate of increase in gas consumption as the outside temperature decreases is less when the cavity wall is insulated. Are these differences significant?

Observation	Insulation	Outside Temp. ( ${}^{\circ}$ C)	Gas consumption
1	Before	-0.8	7.2
2	Before	-0.7	6.9
3	Before	0.4	6.4
4	Before	2.5	6.0
5	Before	2.9	5.8
6	Before	3.2	5.8
7	Before	3.6	5.6
8	Before	3.9	4.7
9	Before	4.2	5.8
10	Before	4.3	5.2
11	Before	5.4	4.9
12	Before	6.0	4.9
13	Before	6.0	4.3
14	Before	6.0	4.4
15	Before	6.2	4.5
16	Before	6.3	4.6
17	Before	6.9	3.7
18	Before	7.0	3.9
19	Before	7.4	4.2
20	Before	7.5	4.0
21	Before	7.5	3.9
22	Before	7.6	3.5
23	Before	8.0	4.0
24	Before	8.5	3.6
25	Before	9.1	3.1
26	Before	10.2	2.6
27	After	-0.7	4.8
28	After	0.8	4.6
29	After	1.0	4.7
30	After	1.4	4.0
31	After	1.5	4.2
32	After	1.6	4.2
33	After	2.3	4.1
34	After	2.5	4.0
35	After	2.5	3.5
36	After	3.1	3.2
37	After	3.9	3.9
38	After	4.0	3.5
39	After	4.0	3.7
40	After	4.2	3.5
41	After	4.3	3.5
42	After	4.6	3.7
43	After	4.7	3.5
44	After	4.9	3.4

Table 7.2: Outside temperature (

{}^{\circ}

C), gas consumption (1000’s cubic feet) and whether or not cavity wall insulation has been installed.

7.2 Predicted values

Once we have estimated the regression coefficients $\hat{\beta}$ , we can estimate predicted values of the response variable. The predicted value for individual $i$ is defined as

{mdframed}

\displaystyle\hat{\mu}_{i}=\hat{\beta}_{1}x_{i,1}+\hat{\beta}_{2}x_{i,2}+% \ldots+\hat{\beta}_{p}x_{i,p}.

(7.4)

This equation can also be used to obtain predicted values for combinations of explanatory variables unobserved in the sample (see example 7.2.1). However, care should be taken not to extrapolate too far outside of the observed ranges of the explanatory variables.

The predicted value is interpreted as the expected value of the response variable for a given set of explanatory variable values. Predicted values are useful for checking model fit, calculating residuals and as model output.

TheoremExample 7.2.1 Birthweights cont.

Recall the simple linear regression example on birth weights,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i}

where $x_{i}$ is gestational age at birth. We obtained $\hat{\beta}=(\hat{\beta}_{1},\hat{\beta}_{2})=(-1485,116)$ .

Can you predict the birth weight of a child at 37.5 weeks?

\displaystyle\hat{y}=\hat{\beta}_{1}+\hat{\beta}_{2}\times 37.5=-1485+116% \times 37.5=2865\text{~{}grams}.

7.2.1 Estimation of residual variance $\sigma^{2}$

From the definition of the linear regression model there is one other parameter to be estimated: the residual variance $\sigma^{2}$ . We estimate this using the variance of the estimated residuals.

The estimated residuals are defined as,

\displaystyle\hat{\epsilon}_{i}=y_{i}-\hat{\mu}_{i}=y_{i}-\hat{\beta}_{1}x_{i,% 1}-\ldots-\hat{\beta}_{p}x_{i,p},

(7.5)

and we estimate $\sigma^{2}$ by

\displaystyle\hat{\sigma}^{2}=\frac{1}{n-p}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}.

The heuristic reason for dividing by $n-p$ , rather than $n$ , is that although the sum is over $n$ residuals these are not independent since each is a function of the $p$ parameter estimates $\hat{\beta}_{1},\ldots,\hat{\beta}_{p}$ . Dividing by $n-p$ then gives an unbiased estimate of the residual variance. This is the same reason that we divide by $n-1$ , rather than $n$ , to get the sample variance. The square root of the residual variance, $\sigma$ , is referred to as the residual standard error.

TheoremExample 7.2.2 Birth weights cont.

Returning to the simple linear regression on birth weight. To calculate the residuals we subtract the fitted birth weights from the observed birth weights. The birth weights are

1

$\hat{\mu}_{1}=-1485+116\times 40=3155$ ,
2

$\hat{\mu}_{2}=-1485+116\times 38=2923$ ,
3

…

What are the residuals?

The estimated residuals are

1

$\hat{\epsilon}_{1}=y_{1}-\hat{\mu}_{1}=2968-3155=-187$ ,
2

$\hat{\epsilon}_{2}=y_{2}-\hat{\mu}_{2}=2795-2923=-128$ ,
3

…

What is the estimate of the residual variance?

Since $n=24$ and $p=2$ ,

\displaystyle\hat{\sigma}^{2}=\frac{1}{n-p}\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2% }=\frac{1}{24-2}[(-187)^{2}+(-128)^{2}+\ldots]=37455.09.

The estimated residuals can also be obtained from the lm fit in R,

⬇

> bwtlm$residuals

So we can calculate the residual variance as

⬇

> sum(bwtlm$residuals^2)/22

Why is this estimate slightly different to the one obtained previously?

We used rounded values of $\hat{\beta}$ to calculate the estimates. In fact, when we look at the residual standard error ( $\hat{\sigma}$ ), the error made by using rounded estimates is much smaller.

Finally, lm also gives the residual standard error directly, via the summary function,

⬇

> summary(bwtlm)

Call:

lm(formula = bwt$Weight ~ bwt$Age)

Residuals:

Min 1Q Median 3Q Max

-262.03 -158.29 8.35 88.15 366.50

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1485.0 852.6 -1.742 0.0955 .

bwt$Age 115.5 22.1 5.228 3.04e-05 ***

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 192.6 on 22 degrees of freedom

Multiple R-squared: 0.554, Adjusted R-squared: 0.5338

F-statistic: 27.33 on 1 and 22 DF, p-value: 3.04e-05

Remark.

summary is a very useful command, for example it allows you to view the parameter estimates of a fitted model. We will use it more later in the course.

7.3 Summary

{mdframed}

1

The regression coefficients $\beta_{1},\ldots,\beta_{p}$ are estimated by minimising the sum of squares function

$S(\beta)=\sum_{i=1}^{n}(Y_{i}-\beta_{1}x_{i,1}-\ldots-\beta_{p}x_{i,p})^{2}=(Y% -X\beta)^{\prime}(Y-X\beta).$
2

The sum of squares function is a function of $p$ variables (the regression coefficients). To minimise this function we must calculate $p$ first-order partial derivatives, set each of these equal to zero and solve the resulting set of $p$ simultaneous equations.
3

The least squares estimator is given by

$\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}Y.$
4

To estimate the residual variance, the least squares function is evaluated at $\hat{\beta}$ and then divided by $n-p$

$\hat{\sigma}^{2}=\frac{1}{n-p}\sum_{i=1}^{n}(y_{i}-\hat{\beta}_{1}x_{i,1}-% \ldots-\hat{\beta}_{p}x_{i,p})^{2}.$
5

The predicted values from a linear regression model are defined as

$\hat{\mu}_{i}=\hat{\beta}_{1}x_{i,1}+\ldots+\hat{\beta}_{p}x_{i,p}.$
6

The predicted value is the expected (or mean) value of the response for that particular combination of explanatory variables.

Chapter 8 Sampling distribution of estimators

So far, we have focused on estimation and interpretation of the regression coefficients. In practice, it is never sufficient just to report parameter estimates, without also reporting either a standard error or confidence interval. These measures of uncertainty can also be used to decide whether or not the relationships represented by the regression models are significant.

As for the estimators discussed in Part 1, $\hat{\beta}$ and $\hat{\sigma}^{2}$ are random variables, since they are both functions of the response vector $Y$ . Consequently, they each have a sampling distribution. This is our starting point in deriving standard errors and confidence intervals.

8.1 Regression coefficients

We can write the regression coefficients as

\displaystyle\hat{\beta}=AY

where $A=(X^{\prime}X)^{-1}X^{\prime}$ .

Since $A$ is considered to be fixed, $\hat{\beta}$ is a linear combination of the random variables $Y_{1},\ldots,Y_{n}$ . By the definition of the linear model $Y_{1},\ldots,Y_{n}$ are normal random variables, and so any linear combination of $Y_{1},\ldots,Y_{n}$ is also a normal random variable (by the linearity property of the normal distribution, see Math230).

8.1.1 Expectation of least squares estimator

Find the expectation of $\hat{\beta}$ in terms of $A$ , $X$ and $\beta$ .

\displaystyle\mathbb{E}[\hat{\beta}]=\mathbb{E}[AY]=A\mathbb{E}[Y]

by linearity of expectation, so $\mathbb{E}[\hat{\beta}]=AX\beta$ by definition of linear model.

Now

\displaystyle AX=(X^{\prime}X)^{-1}X^{\prime}X=I_{p}.

where $I_{p}$ is the $p\times p$ identity matrix. Consequently, {mdframed}

\displaystyle\mathbb{E}[\hat{\beta}]=AX\beta=I_{p}\beta=\beta,

so that the estimator is unbiased.

8.1.2 Variance of least squares estimator

To find the variance,

\displaystyle\operatorname{Var}(\hat{\beta})=\operatorname{Var}(AY)=A% \operatorname{Var}(Y)A^{\prime}

by properties of the variance seen in Math230. By definition of linear model,

\displaystyle A\operatorname{Var}(Y)A^{\prime}=A\sigma^{2}I_{n}A^{\prime}=% \sigma^{2}AA^{\prime}.

Now

	$\displaystyle AA^{\prime}$	$\displaystyle=(X^{\prime}X)^{-1}X^{\prime}[(X^{\prime}X)^{-1}X^{\prime}]^{\prime}$
		$\displaystyle=(X^{\prime}X)^{-1}X^{\prime}X(X^{\prime}X)^{-1}$
		$\displaystyle=(X^{\prime}X)^{-1}I_{p}$
		$\displaystyle=(X^{\prime}X)^{-1}.$

Consequently, {mdframed}

\displaystyle\operatorname{Var}(\hat{\beta})=\sigma^{2}AA^{\prime}=\sigma^{2}(% X^{\prime}X)^{-1}.

To summarise, the sampling distribution for the estimator of the regression coefficient is {mdframed}

\displaystyle\hat{\beta}\sim\operatorname{MVN}_{p}(\beta,\sigma^{2}(X^{\prime}% X)^{-1}).

Remark.

We will see in Section 8.4 how to use this result to carry out hypothesis tests on $\beta$ .

Remark.

In practice, the residual variance $\sigma^{2}$ is usually unknown and must be replaced by its estimate $\hat{\sigma}^{2}$ .

8.2 Linear combinations of regression coefficients

Recall model 6.2 for birth weight in Example 6.4.5. This model includes separate intercepts for males ( $\beta_{1}$ ) and females ( $\beta_{2}$ ). We might be interested in the difference between male and female birth weights, $\beta_{1}-\beta_{2}$ , estimated by $\hat{\beta}_{1}-\hat{\beta}_{2}$ . In particular, we might be interested in testing whether or not there is a difference between $\beta_{1}$ and $\beta_{2}$ ,

\displaystyle H_{0}:\beta_{1}-\beta_{2}=0

vs.

\displaystyle H_{1}:\beta_{1}-\beta_{2}\neq 0.

What is an appropriate test statistic for this test? What sampling distribution should we use to obtain the critical region, $p$ -value or confidence interval? Since $\hat{\beta}_{1}-\hat{\beta}_{2}$ is a linear combination of the regression coefficients, we can find its distribution, and hence a test statistic for this test.

In general for the linear combination

\displaystyle a^{\prime}\hat{\beta}=a_{1}\hat{\beta}_{1}+\ldots+a_{p}\hat{% \beta}_{p},

then

	$\displaystyle\mathbb{E}[a^{\prime}\hat{\beta}]$	$\displaystyle=a^{\prime}\mathbb{E}[\hat{\beta}]$
		$\displaystyle=a^{\prime}\beta.$

and

	$\displaystyle\operatorname{Var}(a^{\prime}\hat{\beta})$	$\displaystyle=a^{\prime}\operatorname{Var}(\hat{\beta})a$
		$\displaystyle=a^{\prime}\sigma^{2}(X^{\prime}X)^{-1}a$
		$\displaystyle=\sigma^{2}a^{\prime}(X^{\prime}X)^{-1}a.$

Further, because $\hat{\beta}$ follows a multivariate normal distribution, so $a^{\prime}\hat{\beta}$ follows a normal distribution, {mdframed}

\displaystyle a^{\prime}\hat{\beta}\sim\operatorname{N}(a^{\prime}\beta,\sigma% ^{2}a^{\prime}(X^{\prime}X)^{-1}a).

In practice, the unknown residual variance $\sigma^{2}$ is replaced with the estimate $\hat{\sigma}^{2}$ .

TheoremExample 8.2.1 Birth weights cont.

Recall that model 6.2 for birthweight is

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i% ,3}

where $x_{i,1}$ and $x_{i,2}$ are indicators for male and female respectively, and $x_{i,3}$ is gestational age. Using the data in Table 7.1, we have

\displaystyle X=\begin{bmatrix}1&0&40\\ 1&0&38\\ 1&0&40\\ \vdots&\vdots&\vdots\\ 1&0&38\\ 0&1&40\\ \vdots&\vdots&\vdots\\ 0&1&40\end{bmatrix},

\displaystyle(X^{\prime}X)^{-1}=\begin{bmatrix}19.7&19.8&-0.512\\ 19.8&20.1&-0.517\\ -0.512&-0.517&0.0133\end{bmatrix},

\displaystyle\hat{\sigma}^{2}=31370.

What are the expectation and variance of $\hat{\beta}_{1}-\hat{\beta}_{2}$ ?

First write $\hat{\beta}_{1}-\hat{\beta}_{2}=a^{\prime}(\hat{\beta}_{1},\hat{\beta}_{2},% \hat{\beta}_{3})^{\prime}$ for some $a$ ,

\displaystyle\hat{\beta}_{1}-\hat{\beta}_{2}=1.\hat{\beta}_{1}+(-1)\hat{\beta}% _{2}+(0)\hat{\beta}_{3}=(1,-1,0)\hat{\beta}.

So $a=(1,-1,0)^{\prime}$ . Consequently,

	$\displaystyle\mathbb{E}[\hat{\beta}_{1}-\hat{\beta}_{2}]$	$\displaystyle=(1,-1,0)\beta$
		$\displaystyle=\beta_{1}-\beta_{2}$

and

	$\displaystyle\operatorname{Var}(\hat{\beta}_{1}-\hat{\beta}_{2})$	$\displaystyle=31370\times(1,-1,0)\begin{bmatrix}19.7&19.8&-0.512\\ 19.8&20.1&-0.517\\ -0.512&-0.517&0.0133\end{bmatrix}\begin{bmatrix}1\\ -1\\ 0\end{bmatrix}$
		$\displaystyle=31370\times 0.169$
		$\displaystyle=5301.$

8.3 Residual error

The sampling distribution of the estimator $\hat{\sigma}^{2}$ of the residual error follows a $\chi^{2}_{n-p}$ distribution. We do not give a formal proof of this here, but the intuition is that the estimator $\hat{\sigma}^{2}$ is the sum of squares of Normal random variables (the estimated residuals), and hence has a $\chi^{2}$ distribution. The degrees of freedom $n-p$ comes from the fact that the estimated residuals are not independent (each is a function of the estimated regression coefficients $\hat{\beta}_{1},\ldots,\hat{\beta}_{p}$ ). Additionally, in the same way that the sample mean and variance are independent, so too are the estimators of the regression coefficients $\hat{\beta}$ and the residual variance $\hat{\sigma}^{2}$ . Although we do not prove this result, it is used below to justify an hypothesis test for the regression coefficient $\beta_{j}$ .

8.4 Hypothesis tests for the regression coefficients

The question that is typically asked of a regression model is ‘Is there evidence of a significant relationship between an explanatory variable and a response variable ?’. For example, ‘Is there evidence that domestic gas consumption increases as outside temperatures decrease?’

An equivalent way to ask this is ‘Is there evidence that the regression coefficient $\beta_{j}$ associated with the explanatory variable $x_{j}$ of interest is significantly different to zero?’ This can be answered by testing

\displaystyle H_{0}:\beta_{j}=0\text{ vs.~{}}H_{1}:\beta_{j}\neq 0.

(8.1)

More generally we can test

\displaystyle H_{0}:\beta_{j}=b\text{ vs.~{}}H_{1}:\beta_{j}\neq b.

(8.2)

In analogy with the tests in Part 1, the test statistic required for the hypothesis test (8.2) is

\displaystyle t=\frac{\hat{\beta}_{j}-b}{\sqrt{\hat{\sigma}^{2}(X^{\prime}X)^{% -1}_{j,j}}}

(8.3)

where $(X^{\prime}X)^{-1}_{j,j}$ is the $j$ -th diagonal element of $(X^{\prime}X)^{-1}$ . Since

1

$\hat{\beta}_{j}$ follows a Normal distribution;
2

$\hat{\sigma}^{2}$ follows a $\chi^{2}_{n-p}$ distribution;
3

And $\beta_{j}$ is independent of $\hat{\sigma}^{2}$ ,

the test statistic follows a $t$ -distribution with $n-p$ degrees of freedom.

Note that the standard error of $\hat{\beta}_{j}$ is $\sqrt{\hat{\sigma}^{2}(X^{\prime}X)^{-1}_{j,j}}$ .

Linear combinations of the regression coefficients

A similar approach can be taken for linear combinations of regression coefficients. From Section 8.2, we know that $a^{\prime}\hat{\beta}$ also has a normal distribution, with mean $a^{\prime}\beta$ and variance $\sigma^{2}a^{\prime}(X^{\prime}X)^{-1}a$ . To test

\displaystyle H_{0}:a^{\prime}\beta=b

vs.

\displaystyle H_{1}:a^{\prime}\beta\neq b,

we use a similar argument as above, comparing the test statistic

\displaystyle t=\frac{a^{\prime}\hat{\beta}-b}{\sqrt{\operatorname{Var}(a^{% \prime}\hat{\beta})}}

to the $t_{n-p}$ distribution.

The variance of $a^{\prime}\hat{\beta}$ can be calculated using the expression $\sigma^{2}a^{\prime}(X^{\prime}X)^{-1}a$ .

TheoremExample 8.4.1 Birth weights cont.

Recall the simple linear regression relating birth weight to gestational age at birth,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i}.

We want to test whether gestational age has a significant positive effect on birth weight, that is, $H_{0}:\beta_{2}=0$ vs. $H_{1}:\beta_{2}>0$ .

First, calculate $(X^{\prime}X)^{-1}$ . From Example 7.1.1 this is,

\displaystyle(X^{\prime}X)^{-1}=\begin{bmatrix}19.6&-0.507\\ -0.507&0.0132\end{bmatrix}.

Now calculate the test statistic,

\displaystyle t=\frac{\hat{\beta}_{2}-b}{\sqrt{\hat{\sigma}^{2}(X^{\prime}X)^{% -1}_{2,2}}}=\frac{116-0}{\sqrt{37455\times 0.0132}}=\frac{116}{22.2}=5.22.

Compare $t=5.22$ to the $t_{22}$ distribution. From R, $t_{22}(0.95)$ is 1.72. Since $5.22>1.72$ , we conclude that there is evidence to reject $H_{0}$ at the 5% level and say that gestational age at birth does affect birth weight.

We can use R to help us with the test. Consider again the output of the summary function

⬇

> summary(bwtlm)

summary(bwtlm)

Call:

lm(formula = bwt$Weight ~ bwt$Age)

Residuals:

Min 1Q Median 3Q Max

-262.03 -158.29 8.35 88.15 366.50

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1485.0 852.6 -1.742 0.0955 .

bwt$Age 115.5 22.1 5.228 3.04e-05 ***

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 192.6 on 22 degrees of freedom

Multiple R-squared: 0.554, Adjusted R-squared: 0.5338

F-statistic: 27.33 on 1 and 22 DF, p-value: 3.04e-05

Note that we can obtain the standard error of $\hat{\beta}_{2}$ directly from the Coefficients table. In fact, we can obtain the $p$ -value for the required test, $p=0.0000304/2=0.0000152<0.05$ . This is clearly significant at the 5% level. However, if we want to test $\beta_{j}=b$ where $b\neq 0$ , we must still calculate the test statistic by hand. For examination purposes, you will be expected to be able to calculate the test statistic using expression (8.3), so do not become too reliant on R.

TheoremExample 8.4.2 Gas consumption cont.

Recall from equation (6.6) the model relating gas consumption to outside temperature and whether or not cavity wall insulation has been installed. We fitted this model in Example 7.1.3.

1

Before cavity wall insulation was installed, was there a significant relationship between outside temperature and gas consumption?
2

After cavity wall insulation was installed, is there a significant relationship between outside temperature and gas consumption?

To answer question 1, we test

\displaystyle H_{0}:\beta_{2}=0

vs.

\displaystyle H_{1}:\beta_{2}\neq 0.

To do this, first calculate the estimated residual variance $\hat{\sigma}^{2}=0.0728$ by calculating the fitted values, then the residuals and finally using expression (7.5).

Since we are testing $\beta_{2}$ , we need $(X^{\prime}X)^{-1}_{2,2}$ . In R,

⬇

> X <- matrix(cbind(rep(1,44),gas$Temp,gas$Insulate2,

gas$Insulate2*gas$Temp),ncol=4)

> solve(t(X)%*%X)[2,2]

[1] 0.004846722

Then the test statistic is

\displaystyle t=\frac{\hat{\beta}_{2}-0}{\sqrt{\hat{\sigma}^{2}(X^{\prime}X)^{% -1}_{2,2}}}=\frac{-0.393}{\sqrt{0.0728\times 0.00485}}=-20.9

Since $n=44$ and $p=4$ , we compare to $t_{40}(0.025)=-2.021$ . Clearly $|-20.9|=20.9>2.021$ , so we conclude that there is evidence to reject the null hypothesis at the 5% level. i.e. there is evidence of a relationship between outside temperature and gas consumption.

We have seen that the relationship between gas consumption and outside temperature after insulation is given by $\beta_{2}+\beta_{4}$ . So, to answer question 2, we need to test

\displaystyle H_{0}:\beta_{2}+\beta_{4}=0

vs.

\displaystyle H_{1}:\beta_{2}+\beta_{4}\neq 0.

First, calculate the variance of $\hat{\beta}_{2}+\hat{\beta}_{4}$ . From Math230,

	$\displaystyle\operatorname{Var}(\hat{\beta}_{2}+\hat{\beta}_{4})$	$\displaystyle=\operatorname{Cov}(\hat{\beta}_{2}+\hat{\beta}_{4},\hat{\beta}_{% 2}+\hat{\beta}_{4})$
		$\displaystyle=\operatorname{Var}(\hat{\beta}_{2})+2\operatorname{Cov}(\hat{% \beta}_{2},\hat{\beta}_{4})+\operatorname{Var}(\hat{\beta}_{4})$
		$\displaystyle=\sigma^{2}(X^{\prime}X)^{-1}_{2,2}+2\sigma^{2}(X^{\prime}X)^{-1}% _{2,4}+\sigma^{2}(X^{\prime}X)^{-1}_{4,4}$
		$\displaystyle=\sigma^{2}\left[(X^{\prime}X)^{-1}_{2,2}+2(X^{\prime}X)^{-1}_{2,% 4}+(X^{\prime}X)^{-1}_{4,4}\right].$

Next, obtain the required elements from $(X^{\prime}X)^{-1}$ ,

⬇

> X <- matrix(cbind(rep(1,44),gas$Temp,gas$Insulate2,

gas$Insulate2*gas$Temp),ncol=4)

> solve(t(X)%*%X)[2,2]

[1] 0.004846722

> solve(t(X)%*%X)[2,4]

[1] -0.004846722

> solve(t(X)%*%X)[4,4]

[1] 0.02723924

and calculate the test statistic,

	$\displaystyle t$	$\displaystyle=\frac{\hat{\beta}_{2}+\hat{\beta}_{4}}{\sqrt{\operatorname{Var}(% \hat{\beta}_{2}+\hat{\beta}_{4})}}$
		$\displaystyle=\frac{-0.393+0.144}{\sqrt{0.0728\times(0.00485+2\times-0.00485+0% .273)}}$
		$\displaystyle=\frac{-0.250}{\sqrt{0.0728\times 0.0272}}$
		$\displaystyle=-6.18.$

Finally, compare $t=|-6.18|=6.18$ to $t_{40}(0.975)=2.021$ . Since $6.18>2.021$ we conclude that, at the 5% level, there is evidence of a relationship between gas consumption and outside temperature, once insulation has been installed. So, the insulation has not entirely isolated the house from the effects of external temperature, but it does appear to have weakened this relationship.

8.5 Confidence intervals for the regression coefficients

We can also use the sampling distributions of $\hat{\beta}_{j}$ and $\hat{\sigma}^{2}$ to create a $100(1-\alpha)\%$ confidence interval for $\beta_{j}$ , {mdframed}

\displaystyle\hat{\beta}_{j}\pm t_{n-p}(1-\alpha/2)\times\sqrt{\hat{\sigma}^{2% }(X^{\prime}X)^{-1}_{j,j}}

As discussed in Part 1, the confidence interval can be used to test $H_{0}:\beta_{j}=b$ against $H_{1}:\beta_{j}\neq b$ . The null hypothesis is rejected at the $\alpha\%$ significance level if $b$ does not lie in the $100(1-\alpha)\%$ confidence interval.

To test against the one-tailed alternatives,

1

$H_{1}:\beta_{j}>b$ . Calculate the $100(1-2\alpha)\%$ confidence interval and reject $H_{0}$ at the $\alpha\%$ level if $b$ lies below the lower bound of the confidence interval;
2

$H_{1}:\beta_{j}<b$ . Calculate the $100(1-2\alpha)\%$ confidence interval and reject $H_{0}$ at the $\alpha\%$ level if $b$ lies above the upper bound of the confidence interval.

TheoremExample 8.5.1 Birth weights: confidence interval and two tailed test

Derive a 95% confidence interval for the regression coefficient representing this relationship between weight and gestational age at birth.

We have all the information to do this from the previous example,

1

$\hat{\beta}_{2}=116$ , $\operatorname{se}(\hat{\beta}_{2})=\sqrt{\hat{\sigma}^{2}(X^{\prime}X)^{-1}_{2% ,2}}=22.2$ and $t_{22}(0.975)=2.074$ .
2

Then the 95% confidence interval for $\beta_{2}$ is

$\displaystyle\hat{\beta}_{2}\pm t_{22}(0.975)\times\sqrt{\hat{\sigma}^{2}(X^{% \prime}X)^{-1}_{2,2}}=116\pm 2.074\times 22.2=(70.0,162.0).$

Since zero lies outside this interval, there is evidence at the 5% level to reject $H_{0}$ , i.e. there is evidence of a relationship between gestational age and weight at birth.

TheoremExample 8.5.2 Birth weights: confidence interval and one-tailed test

Since zero lies below the confidence interval, we might want to test

\displaystyle H_{0}:\beta_{2}=0

vs.

\displaystyle H_{1}:\beta_{2}>0.

To test at the 5% level, use $t_{22}(0.95)=1.717$ to calculate a 90% confidence interval and see if zero lies to the left of this confidence interval.

As above,

\displaystyle\hat{\beta}_{2}\pm t_{22}(0.95)\times\sqrt{\hat{\sigma}^{2}(X^{% \prime}X)^{-1}_{2,2}}=116\pm 1.717\times 22.2=(77.9,154.1)

Since $0<77.9$ , we conclude that there is evidence for a positive relationship between gestational age and weight at birth.

8.6 Summary

{mdframed}

1

The least squares estimator $\hat{\beta}=(X^{\prime}X)^{-1}X^{\prime}Y$ is a random variable, since it is a function of the random variable $Y$ .
2

To obtain a sampling distribution for $\hat{\beta}$ we write

$\hat{\beta}=AY$

where $A=(X^{\prime}X)^{-1}X^{\prime}$ and so $\hat{\beta}$ is a linear combination of $n$ normal random variables $Y_{1},\ldots,Y_{n}$ .
3

From this it follows that the sampling distribution is

$\hat{\beta}\sim\operatorname{Normal}(\beta,\sigma^{2}(X^{\prime}X)^{-1}).$
4

We can use this distribution to calculate confidence intervals or conduct hypothesis tests for the regression coefficients.
5

The most frequent hypothesis test is to see whether or not a covariate has a ‘significant effect’ on the response variable. We can test this by testing

$H_{0}:\beta_{j}=0$

vs.

$H_{1}:\beta_{j}\neq 0.$

A one-sided alternative can be used if there is some prior belief about whether the relationship should be positive or negative.
6

In a similar way, a sampling distribution can be derived for linear combinations of the regression coefficients $a^{\prime}\hat{\beta}$ :

$a^{\prime}\hat{\beta}\sim\operatorname{Normal}(a^{\prime}\beta,\sigma^{2}a^{% \prime}(X^{\prime}X)^{-1}a).$

Chapter 9 Explanatory variables: some interesting issues

We have introduced the linear model, and discussed some properties of its estimators. Over the remaining chapters we discuss further modelling issues that can arise when fitting a linear regression model. These include

1

Collinearity and interactions between explanatory variables
2

Covariate selection (sometimes referred to as model selection)
3

Prediction
4

Diagnostics (assessing goodness of model fit to the data)

9.1 Collinearity

Collinearity arises when there is linear dependence (strong correlation) between two or more explanatory variables. We say that two explanatory variables $x_{i}$ and $x_{j}$ are

{mdframed}

1

Orthogonal if $\operatorname{corr}(x_{i},x_{j})$ is close to zero;
2

Collinear if $\operatorname{corr}(x_{i},x_{j})$ is close to 1.

Collinearity is undesirable because it means that the matrix $X^{\prime}X$ is ill conditioned, and inversion of $(X^{\prime}X)$ is numerically unstable. It can also make results difficult to interpret.

TheoremExample 9.1.1 Cereal prices

In Example 6.4.4, we related annual maize prices, $Y_{i}$ , to annual prices of barley, $x_{i,1}$ , and wheat, $x_{i,2}$ . Consider the three models:

1

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}$ ,
2

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,2}$ ,
3

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}$ .

We fit the models using R. First load the data set cereal, and see what it contains,

⬇

> names(cereal)

[1] "Year" "Barley" "Cotton" "Maize" "Rice" "Wheat"

To fit the three models,

⬇

> lm1 <- lm(cereal$Maize~cereal$Barley)

> lm2 <- lm(cereal$Maize~cereal$Wheat)

> lm3 <- lm(cereal$Maize~cereal$Barley+cereal$Wheat)

Now let’s look at the estimated coefficients in each model.

⬇

> lm1$coefficients

(Intercept) cereal$Barley

-9.484660 1.085748

> lm2$coefficients

(Intercept) cereal$Wheat

-30.8254882 0.9491281

> lm3$coefficients

(Intercept) cereal$Barley cereal$Wheat

-25.6646279 -0.5095537 1.3207563

In the two covariate model, the coefficient for both Barley and Wheat are considerably different to the equivalent estimates obtained in the two one covariate models. In particular the relationship with Barley is positive in model 1 and negative in model 3. What is going on here?

To investigate, we check which of the covariates has a significant relationship with maize prices in each of the models. We will use the confidence interval method. For this we need the standard errors of the regression coefficients, which can be found by hand or using the summary function, e.g.

⬇

> summary(lm1)

Call:

lm(formula = cereal$Maize ~ cereal$Barley)

Residuals:

Min 1Q Median 3Q Max

-106.401 -21.731 -5.482 21.282 89.921

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.4847 32.2742 -0.294 0.772

cereal$Barley 1.0857 0.1919 5.657 1.88e-05 ***

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 45.22 on 19 degrees of freedom

Multiple R-squared: 0.6274, Adjusted R-squared: 0.6078

F-statistic: 32 on 1 and 19 DF, p-value: 1.875e-05

so that the standard error for $\beta_{2}$ in model 1 is 0.1919.

For model 1, the 95% confidence interval for $\beta_{2}$ (barley) is

\displaystyle\hat{\beta}_{2}\pm t_{21-2}(0.975)\times\operatorname{se}(\hat{% \beta}_{2})=1.09\pm 2.093\times 0.1919=(0.684,1.487).

For model 2, the 95% confidence interval for $\beta_{2}$ (wheat) is

\displaystyle 0.949\pm 2.093\times 0.1109=(0.717,1.18).

For model 3, the 95% confidence interval for $\beta_{2}$ (barley) is

\displaystyle\hat{\beta}_{2}\pm t_{21-3}(0.975)\times\operatorname{se}(\hat{% \beta}_{2})=-0.510\pm 2.101\times 0.4076=(-1.366,0.347).

and for $\beta_{3}$ (wheat) it is

\displaystyle 1.321\pm 2.101\times 0.3167=(0.655,1.99).

We can conclude that

1

If barley alone is included, then it has a significant relationship with maize price (at the 5% level).
2

If wheat alone is included, then it has a significant relationship with maize price (at the 5% level).
3

If both barley and wheat are included, then the relationship with barley is no longer significant (at the 5% level).

Why is this?

The answer comes if we look at the relationship between barley and wheat prices, see Figure 9.1. The sample correlation between these variables is 0.939, indicating a very strong linear relationship. Since their behaviour is so closely related, we do not need both as covariates in the model.

If we do include both, then it is impossible for the model to accurately identify the individual relationships. We should use either model 1 or model 2. However, there is no statistical way to compare these two models; but one possibility is to select the one which has smallest $p$ value associated with $\beta_{2}$ , i.e. the one with the strongest relationship between the covariate and the response.

Fig. 9.1: Forecasts of annual prices for wheat against barley. Prices are in dollars per tonne.

9.2 Interactions

We touched on the concept of an interaction in the gas consumption model given in Example 6.4.6. In this example, the relationship between gas consumption and outside temperature was altered by the installation of cavity wall insulation. This is an interaction between a factor (insulated or not) and a covariate (temperature).

Suppose that the response variable is $Y_{i}$ and there are two explanatory variables $x_{i,1}$ and $x_{i,2}$ . We could either

1

Model the main effects only,

$\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}$
2

Or include an interaction as well

$\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}+% \beta_{4}x_{i,1}x_{i,2}$

Note the interaction term is sometimes written as $x_{i,1}\times x_{i,2}$ .

9.2.1 Interaction between two factors

We illustrate the idea of an interaction using a thought experiment.

A clinical trial is to be carried out to investigate the effect of the doses of two drugs A and B on a medical condition. Both drugs are available at two dose levels. All four combinations of drug-dose levels will be investigated.

$N$ patients are randomly assigned to each of the possible combinations of drug-dose levels, so that $N/4$ patients receive each combination. The response variable is the increase from pre- to post-treatment red blood cell count. The average increase is calculated for each drug-dose level combination.

In all three outcomes, the level of both drugs affects cell count.

1

In outcome 1, cell count increases with dose level of both A and B. Since the size and direction of the effect of the dose level of drug A on the cell count is unchanged by changing the dose level of drug B there is no interaction.
2

In outcome 2, there is an interaction; at level 1 of drug B, the cell count is lower for drug A level 2, than for drug A level 1. Conversely, at level 2 of drug B, the cell count is lower for drug A level 1, than for drug A level 2. The direction of the effect of the dose levels of drug A is altered by changing the dose of drug B.
3

In outcome 3, there is also an interaction. In this case increasing dose level of drug A increases cell count, regardless of the level of drug B. But the difference in the response for levels 1 and 2 of drug A is much greater for level 1 of drug B than it is for level 2 of drug B. The size of the effect of the dose levels of drug A is altered by changing the dose of drug B.

9.2.2 Interaction between a factor and a covariate

Continuing with the gas consumption example. Of interest is the relationship between outside temperature and gas consumption. We saw in Figure LABEL:gas_scatter2gas_scatter3 that the size of this relationship depends on whether or not the house has cavity wall insulation and we wrote this model formally as

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}+% \beta_{4}x_{i,1}x_{i,2}

where

1

The coefficient $\beta_{2}$ is the size of the main effect of outside temperature on gas consumption
2

The coefficient $\beta_{4}$ is the size of the interaction between the effect of outside temperature and whether or not insulation is installed.

To test whether or not there is an interaction, i.e. whether or not installing insulation has a significant effect on the relationship between outside temperature and gas consumption, we can test

\displaystyle H_{0}:\beta_{4}=0

vs.

\displaystyle H_{1}:\beta_{4}\neq 0.

We have previously fitted this model in R,

⬇

gaslm <- lm(gas$Gas~gas$Temp*gas$Insulate2)

We will use the output from this model to speed up our testing procedure,

⬇

> summary(gaslm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.85383 0.11362 60.320 < 2e-16 ***

gas$Temp -0.39324 0.01879 -20.925 < 2e-16 ***

gas$Insulate2 -2.26321 0.17278 -13.099 4.71e-16 ***

gas$Temp:gas$Insulate2 0.14361 0.04455 3.224 0.00252 **

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.2699 on 40 degrees of freedom

Multiple R-squared: 0.9359, Adjusted R-squared: 0.9311

F-statistic: 194.8 on 3 and 40 DF, p-value: < 2.2e-16

Find the standard error for $\hat{\beta}_{4}$

Reading from the second column in the Coefficients table, this is 0.04455.

Calculate the test statistic

\displaystyle t=\frac{\hat{\beta}_{4}-0}{\operatorname{se}(\hat{\beta}_{4})}=% \frac{0.144}{0.04455}=3.22

The value of this test statistic also appears in the output above (where?) What is the critical value?

Compare to $t_{40}(0.975)=2.021$ .

What do we conclude?

Since 3.22 $>$ 2.021 there is evidence at the 5% level to reject $H_{0}$ , i.e. there was a significant change in the relationship between outside temperature and gas consumption following insulation.

9.3 Summary

{mdframed}

1

Collinearity occurs when two explanatory variables are highly correlated.
2

Collinearity makes it hard (sometimes impossible) to disentangle the separate effects of the collinear variables on the response.
3

An interaction occurs when altering the value of one explanatory variable changes the effect of a second explanatory variable on the response.
4

This change could be a change in the size of the effect, in the direction of the effect (positive or negative), or in both of these.

Chapter 10 Covariate selection

Covariate selection refers to the process of deciding which of a number of explanatory variables best explain the variability in the response variable. You can think of it as finding the subset of explanatory variables which have the strongest relationships with the response variable.

We will only look at comparing nested models. Consider two models, the first has $p_{1}$ explanatory variables and the second has $p_{2}>p_{1}$ explanatory variables. We refer to the model with fewer covariates as the simpler model.

An example of a pair of nested models is when the more complicated model contains all the explanatory variables in the simpler model, and an additional $p_{2}-p_{1}$ explanatory variables.

For example, given a response $Y_{i}$ and explanatory variables $x_{i,1}$ , $x_{i,2}$ and $x_{i,3}$ , we could create three possible models;

A

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}$ ,
B

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}$ ,
C

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i,3}$ .

Which model(s) are nested inside model C?

Models A and B are nested inside model C.

Are either of models A or C nested inside model B?

Model A is, since model B is model A with an additional covariate.

Neither model B nor model C is nested in model A.

Write down another model that is nested in model C.

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,2}$ .

Definition (Nesting).

Define model 1 as $\mathbb{E}[Y]=X\beta$ and model 2 as $\mathbb{E}[Y]=A\gamma$ , where $X$ is an $n\times p_{1}$ matrix and $A$ is an $n\times p_{2}$ matrix, with $p_{1}<p_{2}$ . Assume $X$ and $A$ are both of full rank, i.e. neither has linearly dependent columns.

Then model 1 is nested in model 2 if $X$ is a (strict) subspace of $A$

Given a pair of nested models, we will focus on deciding whether there is enough evidence in the data in favour of the more complicated model; or whether we are justified in staying with the simpler model.

The null hypothesis in this test is always that the simpler model is the best fit.

We start with an example.

TheoremExample 10.0.1 Brain weights

In Section 6.2, Example 6.2.3 considered whether the body weight of a mammal could be used to predict it’s brain weight. In addition, we have the average number of hours of sleep per day for each species in the study.

Let $Y_{i}$ denote brain weight, $x_{i,1}$ denote body weight and $x_{i,2}$ denote number of hours asleep per day. Here $i$ denotes species. We will model the log of both brain and body weight.

Which of the following models fits the data best?

1

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}\log x_{i,1}$ ,
2

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}x_{i,2}$ ,
3

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}\log x_{i,1}+\beta_{3}x_{i,2}$ .

There are four species for which sleep time is unknown. For a fair comparison between models, we remove these species from the following study completely, leaving $n=58$ observations.

We can fit each of the models in R as follows,

⬇

> L1 <- lm(log(sleep$BrainWt)~log(sleep$BodyWt))

> L2 <- lm(log(sleep$BrainWt)~sleep$TotalSleep)

> L3 <-

lm(log(sleep$BrainWt)~log(sleep$BodyWt)+sleep$TotalSleep)

Figure 10.1 shows the fitted relationships in models L1 and L2.

Fig. 10.1: Left: Right: log brain weight ( $g$ ) against sleep per day (hours). Data for 58 species of mammals.

Which of these models are nested?

Models L1 and L2 are both nested in model L3.

Using the summary function, we can obtain parameter estimates, and their standard errors, e.g.

⬇

> summary(L1)

The fitted models are summarised in Table 10.1.

Model	$\beta_{1}$	$\beta_{2}$	$\beta_{3}$
L1	2.15 (0.0991)	0.759 (0.0303)	NA
L2	6.17 (0.675)	-0.299 (0.0588)	NA
L3	2.60 (0.288)	0.728 (0.0352)	-0.0386 (0.0237)

Table 10.1: Parameter estimates, with standard errors in brackets for each of three possible models for the mammal brain weight data.

For each model, we can test to see which of the explanatory variables is significant.

For model L1, we test $H_{0}:\beta_{2}=0$ vs. $H_{1}:\beta_{2}\neq 0$ by calculating

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{0.759}{0.0303}=25.09.

Comparing this to $t_{56}(0.975)=2.00$ , we see that $\beta_{2}$ is significantly different to zero at the 5% level. We conclude that there is evidence of a significant relationship between (log) brain weight and log (body weight).

For model L2, to test $H_{0}:\beta_{2}=0$ vs. $H_{1}:\beta_{2}\neq 0$ , calculate

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{-0.299}{0.0588}=-5.092.

Again, the critical value is $t_{56}(0.975)=2.00$ . Since $|-5.092|>2.00$ we conclude that there is evidence of a relationship between hours of sleep per day and (log) brain weight. This is a negative relationship: the more hours sleep per day, the lighter the brain. We cannot perhaps say that this is a causal relationship!

For model L3, we first test $H_{0}:\beta_{2}=0$ vs. $H_{0}:\beta_{2}\neq 0$ , using

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{0.728}{0.352}=20.67.

Next we test $H_{0}:\beta_{3}=0$ vs. $H_{1}:\beta_{3}\neq 0$ , using

\displaystyle t=\frac{\hat{\beta}_{3}}{\operatorname{se}(\hat{\beta}_{3})}=% \frac{-0.0386}{0.0237}=-1.632.

In both cases the critical value is $t_{55}(0.975)=2.00$ ; so, at the 5% level, there is evidence of a relationship between (log) brain weight and (log) body weight, but there is no evidence of a relationship between (log) brain weight and hours of sleep per day.

To summarise, individually, both explanatory variables appear to be significant. However, when we include both in the model, only one is significant. This appears to be a contradiction. So which is the best model to explain variability amongst brain weights in mammals?

In general, we want to select the simplest possible model that explains the most variation.

{mdframed}

Including additional explanatory variables will always increase the amount of variability explained - but is the increase sufficient to justify the additional parameter that must then be estimated?

10.1 The F test

The F-test gives a formal statistical test to choose between two nested models. It is based on a comparison between the sum of squares for each of the two models.

Suppose that model 1 has $p_{1}$ explanatory variables, model 2 has $p_{2}>p_{1}$ explanatory variables and model 1 is nested in model 2. Let model 1 have design matrix $X$ and parameters $\beta$ ; model 2 has design matrix $A$ and parameters $\gamma$ .

First we show formally that adding additional explanatory variables will always improve model fit, by decreasing the residual sum of squares for the fitted model.

If $SS_{1}=(y-X\hat{\beta})^{\prime}(y-X\hat{\beta})$ and $SS_{2}=(y-A\hat{\gamma})^{\prime}(y-A\hat{\gamma})$ are the residual sums of squares for models 1 and 2 respectively. Then

\displaystyle SS_{2}\leq SS_{1}

Why does this last inequality hold?

Because of the nesting, we can always find a value $\tilde{\gamma}$ such that

\displaystyle X\hat{\beta}=A\tilde{\gamma}.

Recalling the definition of the sum of squares,

	$\displaystyle SS_{2}$	$\displaystyle=(y-A\hat{\gamma})^{\prime}(y-A\hat{\gamma})$
		$\displaystyle\leq(y-A\tilde{\gamma})^{\prime}(y-A\tilde{\gamma})$

by definition of LSE. So by definition of $\tilde{\gamma}$ ,

\displaystyle SS_{2}=(y-X\hat{\beta})^{\prime}(y-X\hat{\beta})=SS_{1}.

To carry out the F-test we must decide whether the difference between $SS_{1}$ and $SS_{2}$ is sufficiently large to merit the inclusion of the additional explanatory variables in model 2.

Consider the following hypothesis test

\displaystyle H_{0}:\text{Model 1 is the best fit}

vs.

\displaystyle H_{1}:\text{Model 2 is the best fit}.

Remark.

We do not say that ‘Model 1 is the true model’ or ‘Model 2 is the true model’. All models, be they probabilistic or deterministic, are a simplification of real life. No model can exactly describe a real life process. But some models can describe the truth ‘better’ than others. George Box (1919-), British statistician: ‘essentially, all models are wrong, but some are useful’.

To test $H_{0}$ against $H_{1}$ , first calculate the test statistic

{mdframed}

\displaystyle F=\frac{(SS_{1}-SS_{2})/(p_{2}-p_{1})}{SS_{2}/(n-p_{2})}.

(10.1)

Now compare the test statistic to the $F_{p_{2}-p_{1},n-p_{2}}$ distribution, and reject $H_{0}$ if the test statistic exceeds the critical value (equivalently if the $p$ -value is too small).

The critical value from the $F_{p_{2}-p_{1},n-p_{2}}$ distribution can either be evaluated in R, or obtained from statistical tables.

TheoremExample 10.1.1 Brain weights cont.

We proposed three models for log (brain weight) with the following explanatory variables:

1

log(body weight)
2

hours sleep per day
3

log(body weight)+hours sleep per day

Which of these models can we use the $F$ -test to decide between?

The $F$ -test does not allow us to choose between models L1 and L2, since these are not nested. However, it does give us a way to choose between either the pair L1 and L3, or the pair L2 and L3.

To choose between L1 and L2, we use a more ad hoc approach by looking to see which of the explanatory variables is ‘more significant’ than the other when we test

\displaystyle H_{0}:\beta_{2}=0

vs.

\displaystyle H_{1}:\beta_{2}\neq 0.

Using summary(L1) and summary(L2), we see that the $p$ -value for $\beta_{2}$ in L1 is $<2e^{-16}$ and for $\beta_{2}$ in L2 is $4.30e^{-06}$ . As we saw earlier, both of these indicate highly significant relationships between the response and the explanatory variable in question.

Which of the single covariate models is preferable?

Since the $p$ -value for log(body weight) in model L1 is lower, our preferred single covariate model is L1.

We can now use the $F$ -test to choose between our preferred single covariate model L1 and the two covariate model L3,

\displaystyle H_{0}:\text{L1 is the best fit}

vs.

\displaystyle H_{1}:\text{L3 is the best fit}.

We first find the sum of squares for both models. For L1, using the definition of the least squares,

\displaystyle SS(L1)=\sum_{i=1}^{58}\hat{\epsilon}_{i}^{2}=\sum_{i=1}^{58}(y_{% i}-\hat{\beta}_{1}-\hat{\beta}_{2}x_{i,1})^{2}=\sum_{i=1}^{58}(y_{i}-2.15-0.75% 9x_{i,1})^{2}.

To calculate this in R,

⬇

> sum(L1$residuals^2)

[1] 28.00023

So $SS(L1)=28.0$ .

For L3,

	$\displaystyle SS(L3)$	$\displaystyle=\sum_{i=1}^{58}\hat{\epsilon}_{i}^{2}=\sum_{i=1}^{58}(y_{i}-\hat% {\beta}_{1}-\hat{\beta}_{2}x_{i,1}-\hat{\beta}_{3}x_{i,2})^{2}$
		$\displaystyle=\sum_{i=1}^{58}(y_{i}-2.60-0.728x_{i,1}-(-0.0386)x_{i,2})^{2}.$

To calculate this in R,

⬇

> sum(L3$residuals^2)

[1] 26.70658

So $SS(L3)=26.7$ .

Next, we find the degrees of freedom for the two models. Since $n=58$ ,

1

L1 has $p_{1}=2$ regression coefficients, so the degrees of freedom are $n-p_{1}=58-2=56$ .
2

L3 has $p_{2}=3$ regression coefficients, so the degrees of freedom are $n-p_{2}=58-3=55$ .

Finally we calculate the $F$ -statistic given in equations (10.1),

	$\displaystyle F$	$\displaystyle=\frac{[SS(L1)-SS(L3)]/(p_{2}-p_{1})}{SS(L3)/(n-p_{2})}$
		$\displaystyle=\frac{(28.0-26.7)/(3-2)}{26.7/(58-3)}$
		$\displaystyle=1.29/0.486$
		$\displaystyle=2.67.$

The test statistic $F=2.67$ is then compared to the $F$ distribution with $(p_{2}-p_{1},n-p_{2})=(1,55)$ degrees of freedom. From tables, the critical value is just above 4.00; from R it is 4.02.

What can we conclude from this?

Since $2.67<4.00$ , we conclude that there is no evidence to reject $H_{0}$ . There is no evidence to choose the more complicated model and so the best fitting model is L1.

Remark.

We should not be too surprised by this result, since we have already seen that the coefficient for total sleep time is not significantly different to zero in model L3. Once we have accounted for body weight, there is no extra information in total sleep time to explain any remaining variability in brainweights.

10.1.1 Where does the F-test come from?

From Section 7.2.1, the sum of squares, divided by the degrees of freedom, is an unbiased estimator of the residual variance,

\displaystyle\mathbb{E}[SS/(n-p)]=\sigma^{2}.

Alternatively,

\displaystyle\mathbb{E}[SS]=(n-p)\sigma^{2}.

So if both model 1 and model 2 fit the data then both of their normalised sums of squares are unbiased estimates of $\sigma^{2}$ , and the expected difference in their sums of squares is,

\displaystyle\mathbb{E}[SS_{1}-SS_{2}]=\mathbb{E}[SS_{1}]-\mathbb{E}[SS_{2}]=(% n-p_{1})\sigma^{2}-(n-p_{2})\sigma^{2}=(p_{2}-p_{1})\sigma^{2}.

and $(SS_{1}-SS_{2})/(p_{2}-p_{1})$ is also an unbiased estimator of the residual variance $\sigma^{2}$ .

But if model 1 is not a sufficiently good model for the data

\displaystyle\mathbb{E}[(SS_{1}-SS_{2})/(p_{2}-p_{1})]>\sigma^{2}

since the expected sum of squares for model 1 will be greater than $\sigma^{2}$ as the model does not account for enough of the variability in the response.

It follows that the $F$ -statistic

\displaystyle F=\frac{(SS_{1}-SS_{2})/(p_{2}-p_{1})}{SS_{2}/(n-p_{2})}.

is simply the ratio of two estimates of $\sigma^{2}$ . If model 1 is a sufficient fit, this ratio will be close to 1, otherwise it will be greater than 1.

To see how far the $F$ -ratio must be from from 1 for the result not to have occurred by chance, we need its sampling distribution. It turns out that the appropriate distribution is the $F_{(p_{2}-p_{1}),(n-p_{2})}$ distribution. The proof of this is too long to cover here.

10.2 Link to one-way ANOVA

Recall from Chapter 5 that a one-way ANOVA is a method for comparing the group means of three or more groups; an extension of the unpaired $t$ -test.

It turns out that the one-way ANOVA is a special case of a simple linear model, in which the explanatory variable is a factor with three or more levels, where each level represents membership of one of the groups.

Suppose that the factor has $m$ -levels, then the linear model for a one-way ANOVA can be written as

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\ldots+\beta_% {m}x_{i,m}

where $x_{i,j}$ is the indicator variable for the $j$ -th level of the factor.

The purpose of an ANOVA is to test whether the mean response varies between different levels of the factor. This is equivalent to testing

\displaystyle H_{0}:\beta_{1}=\beta_{2}=\cdots=\beta_{m}

vs.

\displaystyle H_{1}:\beta_{1}\neq\beta_{2}\neq\cdots\neq\beta_{m}.

In turn, this is equivalent to a model selection between

1

$H_{0}$ : Model 1, where $\mathbb{E}[Y_{i}]=\beta_{1}$ ; and
2

$H_{1}$ : Model 2, where $\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\ldots+\beta_{m}x_{i,m}$ .

Now, for model 1 states that all responses share a common population mean, our design matrix is simply a column of 1’s and $\hat{\beta}_{1}=\bar{y}$ , the overall sample mean. For model 2, the design matrix has $m$ columns, with

X_{i,j}=\left\{\begin{array}[]{ll}1&\quad\text{if individual }i\text{ is in % group }j\\ 0&\quad\text{otherwise}\end{array}\right.

Therefore $X^{\prime}X$ is an $m\times m$ diagonal matrix, the diagonal entries of which correspond to the number of individuals in each of the groups,

(X^{\prime}X)_{j,j}=n_{j},

$j=1,\dots,m$ , and $X^{\prime}y$ is a vector of length $m$ , with $j$ -th entry being the sum of all the responses in group $j$ . It follows that

	$\displaystyle\hat{\beta}_{j}$	$\displaystyle=[(X^{\prime}X)^{-1}X^{\prime}y]_{j}$
		$\displaystyle=\frac{1}{n_{j}}\sum_{i=1}^{n}y_{i}I[\text{individual }i\text{ is% in group }j]$
		$\displaystyle=\bar{y}_{j}$

i.e. the least squares estimate of the $j$ -th regression coefficient is the observed mean of that group.

Calculating the sums of squares for the two models, we have

\displaystyle SS_{1}=\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}

which, in ANOVA terminology, is what we referred to has the ‘total sum of squares’, and

\displaystyle SS_{2}=\sum_{i=1}^{n}(y_{i}-\bar{y}_{1}x_{i,1}-\ldots-\bar{y}_{m% }x_{i,m})^{2}

which, in ANOVA terminology, is what we referred to as the within groups sum of squares.

Consequently, the $F$ -ratio for model selection can be shown to be identical to the test statistic used for the one-way ANOVA:

	$\displaystyle F$	$\displaystyle=\frac{(SS_{1}-SS_{2})/(m-1)}{SS_{2}/(n-m)}$
		$\displaystyle=\frac{(SS_{T}-SS_{W})/(m-1)}{SS_{W}/(n-m)}$
		$\displaystyle=\frac{SS_{B}/(m-1)}{SS_{W}/(n-m)}$
		$\displaystyle=\frac{MS_{B}}{MS_{w}}.$

10.3 Summary

{mdframed}

1

Covariate selection is the process of deciding which covariates explain a significant amount of the variability in the response.
2
Two nested linear regression models can be compared using an $F$ -test. Take two models
1. 1
  
  $Y=X\beta+\epsilon_{2}$ ,
2. 2
  
  $Y=A\gamma+\epsilon_{2}$ ,
where $X$ and $A$ have the same number of rows, but the number of columns in $X$ is less than that in $A$ . Provided $X$ and $A$ are of full rank, then the first model is nested in the second if $X$ is a subspace of $A$ .
3

A simple example of nesting is when the more complicated model contains all the explanatory variables in the first model, plus one or more additional ones.
4

Introducing extra explanatory variables will always reduce the residual sum of squares.
5

To compare two nested models with sums of squares $SS_{1}$ (simpler model) and $SS_{2}$ (more complicated model), calculate

$F=\frac{(SS_{1}-SS_{2})/(p_{2}-p_{1})}{SS_{2}/(n-p_{2})}$

and compare this to the $F_{p_{2}-p_{1},n-p_{2}},$ distribution.
6

The one-way ANOVA can be shown to be a special case of the multiple linear regression model, in which all the explanatory variables are taken to be indicator functions denoting membership of each of the $m$ groups.

Chapter 11 Diagnostics

Even if a valid statistical method, such as the $F$ -test, has been used to select our preferred linear model, checks should still be made to ensure that this model fits the data well. After all, it could be that all the models that we tried to fit actually described the data poorly, so that we have just made the best of a bad job. If this is the case, we need to go back and think about the underlying physical processes generating the data to suggest a better model.

Diagnostics refers to a set of tools which can be used to check how well the model describes (or fits) the data. We will use diagnostics to check that

1

The estimated residuals $\hat{\epsilon}_{i}$ follow a $\operatorname{Normal}(0,\sigma^{2})$ distribution;
2

The estimated residuals are independent of the covariates used in the model;
3

The estimated residuals are independent of the fitted values $\hat{\mu}_{i}$ ; None of the observations is an outlier (perhaps due to measurement error);
4

No observation has undue influence on the estimated model parameters.

11.1 Normality of residuals

One of the key underlying assumptions of the linear regression model is that the errors $\epsilon_{i}$ have a Normal distribution. In reality, we do not know the errors and these are replaced with their estimates, $\hat{\epsilon}_{i}$ . We compare these estimated residuals to their model distribution using graphical diagnostics:

PP (probability-probability) and QQ plots can be used to check whether or not a sample of data can be considered to be a sample from a statistical model (usually a probability distribution). In the case of the normal linear regression model, they can be used to check whether or not the estimated residuals are a sample from a $\operatorname{Normal}(0,\sigma^{2})$ distribution. PP plots show the same information as QQ plots, but on a different scale.

{mdframed}

The PP plot is most useful for checking that values around the average (the body) fit the proposed distribution. It compares the percentiles of the sample of data, predicted under the proposed model, to the percentiles obtained for a sample of the same size, predicted from the empirical distribution

{mdframed}

The QQ plot is most useful for checking whether the largest and smallest values (the tails) fit the proposed distribution. It compares the ordered sample of data to the quantiles obtained for a sample of the same size from the proposed model.

First define the standardised residuals to be

\displaystyle\hat{r}_{i}=\frac{y_{i}-\hat{\mu}_{i}}{\hat{\sigma}}.

From Math230, standardising by $\hat{\sigma}$ means that these should be a sample from a $\operatorname{Normal}(0,1)$ distribution.

Denote by $\hat{r}^{(i)}$ the ordered standardised residuals, so that $\hat{r}^{(1)}$ is the smallest residual, and $\hat{r}^{(n)}$ the largest. We compare the standardised residuals to the standard normal distribution, using

1

A PP plot,

$\displaystyle\left\{\Phi(\hat{r}^{(i)}),\frac{i}{n+1}\right\}$

for $i=1,\ldots,n$ . Here $\Phi$ is the standard normal cumulative distribution function.
2

A QQ plot,

$\displaystyle\left\{\hat{r}^{(i)},\Phi^{-1}\left(\frac{i}{n+1}\right)\right\}$

for $i=1,\ldots,n$ , where $\Phi^{-1}$ is the inverse of the standard normal cumulative distribution function.

If the standardised residuals follow a $\operatorname{Normal}(0,1)$ distribution perfectly, both plots lie on the line $y=x$ . Because of random variation, even if the model is a good fit, the points won’t lie exactly on this line.

Remark.

You have seen QQ plots before in Math104; in that setting, they were used to examine whether data could be considered to be a sample from a Normal distribution.

TheoremExample 11.1.1 Brain weights cont.

In example 10.1.1 we fitted the following linear regression model to try to explain variability in (log) brain weight ( $Y_{i}$ ) using (log) body weight ( $x_{i,1}$ ),

\displaystyle\mathbb{E}[\log Y_{i}]=2.15+0.759\log x_{i,1}

We use R to create PP and QQ plots for the standardised residuals. First we will refit the model in R to obtain the required residuals,

⬇

> L1 <- lm(log(sleep$BrainWt)~log(sleep$BodyWt))

Next we need the residual variance,

⬇

> sigmasq <- sum(L1$resid^2)/56

and we can use this to get the standardised residuals:

⬇

> stdresid <- L1$residuals/sqrt(sigmasq)

R does not have an inbuilt function for creating a PP plot, but we can create one using the function qqplot,

⬇

> qqplot(c(1:58)/59,pnorm(stdresid),

xlab="Theoretical probabilities",ylab="Sample probabilities")

> abline(a=0,b=1)

Since we are comparing the standardised residuals to the standard Normal distribution, we can use the function qqnorm for the QQ plot,

⬇

> qqnorm(stdresid)

> abline(a=0,b=1)

The resulting plots are shown in Figure 11.1 and Figure 11.2. Both plots suggests that the standardised residuals do follow the standard Normal distribution closely.

Fig. 11.1: PP plot for standardised residuals from model for log Brain weight ( $g$ ). Straight lines show exact agreement between the residuals and a Normal(0,1) distribution.

Fig. 11.2: QQ plot for standardised residuals from model for log Brain weight ( $g$ ). Straight lines show exact agreement between the residuals and a Normal(0,1) distribution.

Remark.

In general, a QQ plot is more useful that a PP plot, as it tells us about the more ‘unusual’ values (i.e. the very high and very low residuals). It is the behaviour of these values which is most likely to highlight a lack of model fit.

Remark.

If the PP and QQ plots suggest that the residuals differ from the $\operatorname{Normal}(0,1)$ distribution in a systematic way, for example the points curve up (or down) and away from the 45 ${}^{\circ}$ line at either (or both) of the tails, it may be more appropriate to

1

Transform your response, e.g. use the log or square root functions, before fitting the model; or
2

Use a different residual distribution. This is discussed in Math333 Statistical Models.

Remark.

A lack of normality might also be due to the residuals having non-constant variance, referred to as heteroscedasticity. This can be assessed by plotting the residuals against the explanatory variables included in the model to see whether there is evidence of variability increasing, or decreasing, with the value of the explanatory variable.

11.2 Residuals vs. Fitted values

A further implication of the assumptions made in defining the linear regression model is that the residuals $\epsilon_{i}$ are independent of the fitted values $\hat{\mu}_{i}$ . This can be proved as follows:

Recall the model assumption that

\displaystyle Y\sim\operatorname{MVN}_{n}(X\beta,\sigma^{2}I).

Then the fitted values and estimated residuals are defined as

\displaystyle{\color[rgb]{1,0,0}\hat{\mu}=X\hat{\beta}=X(X^{\prime}X)^{-1}X^{% \prime}Y=HY}

(11.1)

and

\displaystyle\hat{\epsilon}=Y-\hat{\mu}=Y-HY,

(11.2)

where we define

\displaystyle H=X(X^{\prime}X)^{-1}X^{\prime}.

Remark.

Since both $\hat{\mu}$ and $\hat{\epsilon}$ are both functions of the random variable $Y$ , they are themselves random variables. This means that they have sampling distributions. We focus on the joint behaviour of the two random variables.

To show that the fitted values and estimated residuals are independent, we show that the vectors $\hat{\mu}$ and $\hat{\epsilon}$ are orthogonal, i.e. that they have a product of zero.

By definition of $\hat{\mu}$ and $\hat{\epsilon}$ ,

	$\displaystyle\hat{\mu}^{\prime}\hat{\epsilon}$	$\displaystyle=(HY)^{\prime}(Y-HY)$
		$\displaystyle=Y^{\prime}H^{\prime}(Y-HY)$
		$\displaystyle=Y^{\prime}H^{\prime}Y-Y^{\prime}H^{\prime}HY$
		$\displaystyle=Y^{\prime}H^{\prime}Y-Y^{\prime}H^{\prime}Y,$

since $H^{\prime}H=H=H^{\prime}$ . So $\hat{\mu}^{\prime}\hat{\epsilon}=0$ .

This result uses the identities $H^{\prime}H=H$ and $H=H^{\prime}$ . Starting from the definition of $H$ can you prove these identities? Now you should be able to show, by combining these identities, that $H$ is idempotent, i.e. that $H=H^{2}$ .

Since $H$ is idempotent, when applied to its image, the image remains unchanged. In other words, $H$ maps $\hat{\mu}$ to itself. Can you show this? Mathematically, $H$ can also be thought of as a projection. The matrix $H$ is often referred to as the hat matrix, since it transforms the observations $y$ to the fitted values $\hat{\mu}$ .

A sensible diagnostic to check the model fit is to plot the residuals against the fitted values and check that these appear to be independent:

\displaystyle\left\{(\hat{\mu}_{i},\hat{\epsilon}_{i}):i=1,\ldots,n\right\}.

TheoremExample 11.2.1 Brain weights cont.

For the fitted brain weight regression model described in example (11.1.1), a plot of the residuals against the fitted values is shown in Figure 11.3. The code used in R to produce this plot is

⬇

> plot(L1$fitted.values,L1$residuals,xlab="Fitted",

ylab="Residuals")

> R <- lm(L1$residuals~L1$fitted.values)

> abline(a=R$coefficients[1],b=R$coefficients[2])

The horizontal line indicates the line of best fit through the scatter plot. The correlation between the fitted values and residuals is

⬇

> cor(L1$fitted.values,L1$residuals)

[1] -1.691625e-17

In this case, there is clearly no linear relationship between the residuals and fitted values and so, by this criterion, the model is a good fit.

Fig. 11.3: Residual vs. fitted values for the brain weight model. Straight lines show linear relationships, which is negligible.

11.3 Residuals vs. Explanatory variables

For a well fitting model the residuals and the explanatory variables should also be independent. We can again prove this easily, by showing that the vector of estimated residuals is independent of each of the explanatory variables. In other words, each column of the design matrix X is orthogonal to the vector of estimated residuals $\hat{\epsilon}$ .

Therefore, we need to show that

\displaystyle X^{\prime}\hat{\epsilon}=0.

Using the definition of the vector of estimated residuals in (11.2)),

	$\displaystyle X^{\prime}\hat{\epsilon}$	$\displaystyle=X^{\prime}(Y-HY)$
		$\displaystyle=X^{\prime}Y-X^{\prime}HY$
		$\displaystyle=X^{\prime}Y-X^{\prime}Y$
		$\displaystyle=0.$

The penultimate step uses the result $X^{\prime}H=X^{\prime}$ , since, on substitution of the definition of $H$ ,

\displaystyle X^{\prime}H=X^{\prime}X(X^{\prime}X)^{-1}X^{\prime}=IX^{\prime}=% X^{\prime}.

TheoremExample 11.3.1 Brain weights cont.

Figure 11.4 also shows a plot of the residuals from the fitted brain weight regression model in example (11.1.1) against the explanatory variable, the log of body weight. The code to produce this plot is

⬇

plot(log(sleep$BodyWt),L1$residuals,xlab="log(Body Weight)",

ylab="Residuals")

R <- lm(L1$residuals~log(sleep$BodyWt))

abline(a=R$coefficients[1],b=R$coefficients[2])

Fig. 11.4: Residual vs. explanatory variable for the brain weight model. Straight lines show linear relationships, which is negligible.

The horizontal line is the line of best fit through the scatter plot, again indicating to linear relationship between the explanatory variable and the residuals. This is verified by a correlation of $\rho=-1.69\times 10^{-17}$ .

11.4 Outliers

An outlier is an observed response which does not seem to fit in with the general pattern of the other responses. Outliers may be identified using

1

A simple plot of the response against the explanatory variable;
2

Looking for unusually large residuals;
3

Calculating studentized residuals.

The studentized residual for observation $i$ is defined as

\displaystyle s_{i}=\frac{\hat{\epsilon}_{i}}{\hat{\sigma}\sqrt{1-H_{ii}}}

where $H_{ii}$ is the $i$ -th element on the diagonal of the hat matrix $H=X(X^{\prime}X)^{-1}X^{\prime}$ . The term $\hat{\sigma}\sqrt{1-H_{ii}}$ comes from the sampling distribution of the estimated residuals, the derivation of which is left as a workshop question.

Remark.

The diagonal terms $H_{ii}$ are referred to as the leverages. This name comes about since, as $H_{ii}$ gets closer to one, so the fitted value $\hat{\mu}_{i}$ gets closer to the observed value $y_{i}$ . That is an observation with a large leverage will have a considerable influence on its fitted value, and consequently on the model fit.

We can test directly the null hypothesis

\displaystyle H_{0}:\text{Observation }i\text{ is not an outlier}

vs.

\displaystyle H_{1}:\text{Observation }i\text{ is an outlier}

by calculating the test statistic

\displaystyle t_{i}=s_{i}\sqrt{\left(\frac{n-p-1}{n-p-s_{i}^{2}}\right)}.

This is compared to the $t$ -distribution with $n-p-1$ degrees of freedom. We test assuming a two-tailed alternative. If the test is significant, there is evidence that observation $i$ is an outlier.

An alternative definition of $t_{i}$ is based on fitting the regression model, without using observation $i$ . This model is then used to predict the observation $y_{i}$ , and the difference between the observed and predicted values is calculated. If this difference is small, the observation is unlikely to be an outlier as it can be predicted well using only information from the model and the remaining data.

The above discussions focus on identifying outliers, but don’t specify what should be done with them. In practice, we should attempt to find out why the observation is an outlier. This reason will indicate whether the observation can safely be ignored (e.g. it occurred due to measurement error) or whether some additional term should be included in the model to explain it.

TheoremExample 11.4.1 Atmospheric pressure

Weisberg (2005), p.4 presents data from an experiment by the physicist James D. Forbes (1857) on the relationship between atmospheric pressure and the temperature at which water boils. The 17 observations, and fitted linear regression model, are plotted in Figure 11.5.

Fig. 11.5: Atmospheric pressure against the boiling point of water, with fitted line $\mathbb{E}[\text{Pressure}]=-81.1-0.523\text{Temperature}$ .

Are any of the observations outliers?

A plot of the residuals against temperature in Figure 11.6, suggests that observation 12 might be an outlier, since its residual is much larger than the rest ( $\hat{\epsilon}_{12}=0.65$ ).

Fig. 11.6: Residuals from the fitted model against temperature.

To calculate the standardized residuals, we first set up the design matrix $X$ and calculate the hat matrix $H$ ,

⬇

> load("pressures.Rdata")

> n <- length(pressure$Temp)

> X <- matrix(cbind(rep(1,n),pressure$Temp),ncol=2)

> H <- X%*%solve(t(X)%*%X)%*%t(X)

> H[12,12]

[1] 0.06393448

We also need the residual variance

⬇

> L <- lm(pressure$Pressure~pressure$Temp)

> summary(L)

From the summary command we see that the estimated residual standard error $\hat{\sigma}$ is 0.2328. Similarly

⬇

> L$residuals[12]

gives the residual $\hat{\epsilon}_{12}=0.65$ .

Combining these results, the studentized residual is

	$\displaystyle s_{12}$	$\displaystyle=\frac{\hat{\epsilon}_{1}2}{\hat{\sigma}\sqrt{1-H_{12,12}}}$
		$\displaystyle=\frac{0.65}{0.2328\times\sqrt{1-0.0639}}$
		$\displaystyle=2.89.$

Since $n=17$ and $p=2$ , the test statistic is

	$\displaystyle t_{12}$	$\displaystyle=2.89\sqrt{\left(\frac{17-2-1}{17-2-2.89^{2}}\right)}$
		$\displaystyle=4.18.$

The $p$ -value to test whether or not observation 12 is an outlier is then

⬇

> 2*(1-pt(4.18,df=14))

which is $9.25\times 10^{-4}$ . Since this is extremely small, we conclude that there is evidence that observation 12 is an outlier.

11.5 Influence

Outliers can have an unduly large influence on the model fit, but this is not necessarily the case. Conversely, some points which are not outliers may actually have a disproportionate influence on the model fit. One way to measure the influence of an observation on the overall model fit is to refit this model without the observation.

Cook’s distance summarises the difference between the parameter vector $\beta$ estimated using the full data set and the parameter vector $\beta_{(i)}$ obtained using all the data except observation $i$ .

The formula for calculating Cook’s distance for observation $i$ is

\displaystyle D_{i}=\frac{s_{i}^{2}H_{ii}}{p(1-H_{ii})}

where $s_{i}$ is the studentized residual.

It is not straightforward to derive the sampling distribution for this test statistic. Instead it is common practice to follow the following guidelines.

1

First, look for observations with large $D_{i}$ , since if these observations are removed, the estimates of the model parameters will change considerably.
2

If $D_{i}$ is considerably less than 1 for all observations, none of the cases have an unduly large influence on the parameter estimates.
3

For every influential observation identified, the model should be refitted without this observation and the changes to the model noted.

TheoremExample 11.5.1 Atmospheric pressure cont.

We calculate Cook’s distance for the outlying observation (number 12). From the previous example $s_{12}=2.89$ , $H_{12,12}=0.0639$ and $p=2$ . Therefore,

	$\displaystyle D_{12}$	$\displaystyle=\frac{2.89^{2}\times 0.0639}{2\times(1-0.0639)}$
		$\displaystyle=0.285.$

Since this is reasonably far from 1, we conclude that whilst observation 12 is an outlier, it does not appear to have an unduly large influence on the parameter estimates.

11.6 Diagnostics in R

As with everything else we have covered relating to the linear model, model diagnostics can easily be calculated in R. If we consider again the pressure data used in the last two examples. Start by fitting the linear model

⬇

> L <- lm(pressure$Pressure~pressure$Temp)

If we apply the base plot function to a lm fit, we can obtain a total of six possible diagnostic plots. We shall look at four of these: residuals against fitted, a normal QQ plot of the residuals, Cook’s distances and square root of the standardised residuals against the fitted values, and standardised residuals against leverage.

⬇

> par(mfrow=c(2,2))

> plot(L,which=c(1:2,4,5))

The results can be seen in Figure 11.7.

1

For the first plot, we expect to see no pattern, since residuals and fitted values should be independent. The red line indicates any trend in the plot.
2

For the QQ plot we hope to see points lying on the line $y=x$ , indicated by the dotted line.
3

For the Cook’s distance plot we are looking for any particularly large values. The observation number for these will be given (in this example 12, 2 and 17).
4

For the residuals vs. leverage plot we are looking for any points that have either (or both) an unusually large leverage or an unusually large residual. The red dashed lines show contours of equal Cook’s distance.

Fig. 11.7: Diagnostics for the pressure temperature regression model.

An alternative is to download and install the car package, as this has many functions for regression diagnostics. For example

⬇

> library("car")

> influencePlot(L)

produces a bubble plot of the Studentised residuals $\hat{s}_{i}$ against the hat values $H_{ii}$ . The bubbles are proportional to the size of Cook’s distance. In the pressure data example (see Figure 11.8) we can see that observation 12 is a clear outlier. In addition this function prints values of Cook’s distance, the hat value and the Studentised residual for any ‘unusual’ observations.

Fig. 11.8: Hat values, studentised residuals and Cooks distance for the pressure temperature regression model. The sizes of the bubbles are proportional to Cook’s distance.

11.7 Summary

{mdframed}

1

Even though you have selected a best model using appropriate covariate selection techniques, it is still necessary to check that the model fits well. Diagnostics provide us with a set of tools to do this.
2

Diagnostics check that key assumptions made when fitting the linear regression model are in fact satisfied.
3

QQ and PP plots can be used to check that the estimated residuals are approximately normally distributed.
4

Plots of estimated residuals vs. fitted values, and estimated residuals vs. explanatory variables, should also be made, to check that these are independent.
5

The hat matrix

$H=X(X^{\prime}X)^{-1}X^{\prime}$

can be used to prove independence of estimated residuals and fitted values, and of estimated residuals and explanatory variables.
6
In addition the data should be checked for outliers and points of strong influence.
- 1
  
  Outlier: a data point which is unusual compared to the rest of the sample. It usually has a very large studentised residual.
- 2
  
  Influential observation: makes a larger than expected contribution to the estimate of $\hat{\beta}$ . It will have a large value of Cook’s distance.

Chapter 12 Introduction to Likelihood Inference

12.1 Motivation

TheoremExample 12.1.1 London homicides

A starting point is typically a subject-matter question.

Four homicides were observed in London on 10 July 2008.

Is this figure a cause for concern? In other words, is it abnormally high, suggesting a particular problem on that day?

We next need data to answer our question. This may be existing data, or we may need to collect it ourselves.

We look at the number of homicides occurring each day in London, from April 2004 to March 2007 (this just happens to be the range of data available, and makes for a sensible comparison). The data are given in the table below.

No. of homicides per day	0	1	2	3	4	$\geq 5$
Observed frequency	713	299	66	16	1	0

Table 12.1: Number of homicides per day over a 3 year period.

Before we go any further, the next stage is always to look at the data through exploratory analysis.

⬇

> obsdata<-c(713,299,66,16,1,0)

> barplot(obsdata,names.arg=0:5,xlab="Number of homicides", ylab="Frequency",

col="blue")

Next, we propose a model for the data to begin addressing the question.

It is proposed to model the data as iid (independent and identically distributed) draws from a Poisson distribution.

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

The probability mass function (pmf) for the poisson distribution is given by

{\color[rgb]{1,1,1}\Pr[X=x]=\frac{\lambda^{x}\exp(-\lambda)}{x!},}

for $x=0,1,2,\dots$ , so it still remains to make a sensible choice for the parameter $\lambda$ . As before, we will make an estimate of $\lambda$ based on our data; the equation we use to produce the estimate will be a different estimator than in Part 1.

12.2 What is likelihood?

In the last example we discussed that a model for observed data typically has unknown parameters that we need to estimate. We proposed to model the London homicide data by a Poisson distribution. How do we estimate the parameter? In Math104 you met the method of moments estimator (MOME); whilst it is often easy to compute, the estimators it yields can often be improved upon; for this reason it has mostly been superceded by the method of maximum likelihood.

Definition.

Suppose we have data $x_{1},\ldots,x_{n}$ , that arise from a population with pmf (or pdf) $f(\cdot)$ , with at least one unknown parameter $\theta$ . Then the likelihood function of the parameter $\theta$ is the probability (or density) of the observed data for given values of $\theta$ .

If the data are iid, the likelihood function is defined as

L(\theta|x_{1},\ldots,x_{n})=\prod_{i=1}^{n}f(x_{i}|\theta).

The product arises because we assume independence, and joint probabilities (or densities) obey a product law under independence (Math230).

TheoremExample 12.2.1 London homicides continued

For the Poisson example, recall that the probability mass function (pmf) for the Poisson distribution is given by

\Pr[X=x]=\frac{\lambda^{x}\exp(-\lambda)}{x!}.

We will always keep things general to begin by talking in terms of data $x_{1},\ldots,x_{n}$ , rather than the actual data from the table. From the above definition, the likelihood function of the unknown parameter $\lambda$ is

{\color[rgb]{1,1,1}L(\lambda|x_{1},\ldots,x_{n})=\prod_{i=1}^{n}\frac{\lambda^% {x_{i}}\exp(-\lambda)}{x_{i}!}.}

Let’s have a look at how the likelihood function behaves for different values of $\lambda$ . To keep things simple, let’s take just a random sample of 20 days.

⬇

set.seed(1)

#read in the data

x<-c(rep(0,713),rep(1,299),rep(2,66),rep(3,16),4)

%\end{lstlisting}

\begin{lstlisting}

#take a sample: 20 observations

xs<-sample(x,20)

⬇

#likelihood function

L<-function(lam){

m<-length(lam)

sapply(1:m, function(i) prod(dpois(xs,lam[i])))

}

⬇

#values of lambda to plot (trial and error..)

lam<-seq(from=0.3,to=1.1,length=100)

plot(lam,L(lam),type="l")

What does the likelihood plot tell us?

Definition.

Maximum likelihood estimator/estimate. For given data $x_{1},\ldots,x_{n}$ , the maximum likelihood estimate (MLE) of $\theta$ is the value of $\theta$ that maximises $L(\theta|x_{1},\ldots,x_{n})$ , and is denoted $\hat{\theta}$ . The maximum likelihood estimator of $\theta$ based on random variables $X_{1},\ldots,X_{n}$ will be denoted $\hat{\theta}(\bf{X})$ .

So the MLE, $\hat{\theta}$ , is the value of $\theta$ where the likelihood is the largest. This is intuitively sensible.

It is often computationally more convenient to take logs of the likelihood function.

Definition.

The log-likelihood function is the natural logarithm of the likelihood function and, for iid data $x_{1},\ldots,x_{n}$ is denoted thus:

	$\displaystyle l(\theta\|x_{1},\ldots,x_{n})=\log\{L(\theta\|x_{1},\ldots,x_{n})\}$	$\displaystyle=\log\left\{\prod_{i=1}^{n}f(x_{i}\|\theta)\right\}$
		$\displaystyle=\sum_{i=1}^{n}\log\left\{f(x_{i}\|\theta)\right\}.$

One reason that this is a good thing to do is it means we can switch the product to a sum, which is easier to deal with algebraically.

IMPORTANT FACT: Since ‘log’ is a monotone increasing function, $\hat{\theta}$ maximises the log-likelihood if and only if it maximises the likelihood; if we want to find an MLE we can choose which of these functions to try and maximise.

Let us return to the Poisson model. The log-likelihood function for the Poisson data is

	$\displaystyle l(\lambda\|\bf{x})$	$\displaystyle=\sum_{i=1}^{n}\log\left\{\frac{\lambda^{x_{i}}\exp(-\lambda)}{x_% {i}!}\right\}$
		$\displaystyle=\sum_{i=1}^{n}\left\{\log(\lambda^{x_{i}})+\log(\exp(-\lambda))-% \log(x_{i}!)\right\}$
		$\displaystyle=\sum_{i=1}^{n}\left\{x_{i}\log(\lambda)-\lambda-\log(x_{i}!)\right\}$
		$\displaystyle=\log(\lambda)\sum_{i=1}^{n}x_{i}-n\lambda-\sum_{i=1}^{n}\log(x_{% i}!).$

We can now use the maximisation procedure of our choice to find the MLE. A sensible approach in this case is differentiation.

{\color[rgb]{1,1,1}\frac{\partial}{\partial\lambda}l(\lambda|{\bf x})=\frac{1}% {\lambda}\sum_{i=1}^{n}x_{i}-n.}

Now if we solve $\frac{\partial}{\partial\lambda}l(\lambda|{\bf x})=0$ , this gives us potential candidate(s) for the MLE. For the above, there is only one solution:

\displaystyle{\color[rgb]{1,1,1}\hat{\lambda}^{-1}\sum_{i=1}^{n}x_{i}-n=0}

i.e.

\displaystyle{\color[rgb]{1,1,1}\hat{\lambda}=\frac{\sum_{i=1}^{n}x_{i}}{n}=% \bar{x}.}

To confirm that this really is the MLE, we need to verify it is a maximum:

\frac{\partial^{2}}{\partial\lambda^{2}}l(\lambda|{\bf x})=-\lambda^{-2}\sum_{% i=1}^{n}x_{i}<0.

It is clear intuitively that this is a sensible estimator; let’s formalise that into two desirable features we may look for in a ‘good’ estimator. We recall these properties (which you have met before):

Now plugging in the value of $\bar{x}$ we find that

\hat{\lambda}=\frac{(0\times 713)+(1\times 299)+(2\times 66)+(3\times 16)+(4% \times 1)}{713+299+66+16+1}=\frac{483}{1095}.

Plugging in the actual numerical values of the data at the last minute is good practice, as it means we have solved a more general problem, and helps us to remember about sampling variation.

The next step is to check that the Poisson model gives a reasonable fit to the data.

Plotting the observed data against the expected data under the assumed model is a useful technique here.

⬇

> #value of lambda from mom

> momlam<-483/1095

> #expected data counts

> expdata<-1095*dpois(0:5,momlam)

> #make plot

> barplot(rbind(obsdata,expdata),names.arg=0:5,xlab=

"Number of homicides",ylab="Frequency"

col=c("blue","green"),beside=T)

> #add legend

> legend("topright",c("observed","expected"),

col=c("blue","green"),lty=1)

The fit looks very good!

Let us now compare this to another estimator you already know: the method of moments estimator from Math104.

Definition.

Let $X_{1},\ldots,X_{n}$ be an iid sample from pmf or pdf $f(\cdot|\theta)$ , where $\theta$ is one-dimensional. Then the method of moments estimator (MOME) $\hat{\theta}$ solves

\mathbb{E}_{\hat{\theta}}[X]=\frac{1}{n}\sum_{i=1}^{n}X_{i}.

The left hand side is the theoretical moment and the right hand side is the sample moment. What this means is that we should choose the value of $\hat{\theta}$ such that the sample mean is equal to the theoretical mean.

In our example, recall that for a Poisson distributed random variable, $\mathbb{E}[X]=\lambda$ . Therefore the MOME for $\lambda$ is

\hat{\lambda}=\frac{1}{n}\sum_{i=1}^{n}X_{i}.

Notice that the MLE coincides with the MOME. We should be glad it does in this case, because the estimator is clearly sensible!

Whenever the MOME and MLE disagree, the MLE is to be preferred. Much of the rest of this part of the course sets out why that is the case.

Finally, we need to provide an answer to the original question.

Four homicides were observed in London on 10 July 2008. Is this figure a cause for concern?

The probability, on a given day, of seeing four or more homicides under the assumed model, where $X\sim\operatorname{Poisson}(\hat{\lambda}=\frac{483}{1095})$ , is given by

\displaystyle{\color[rgb]{1,1,1}\Pr[X\geq 4]=1-\Pr[X<4]=0.0011}

by Math230 or R.

But what must be borne in mind is that we can say this about any day of the year.

Assuming days are independent (which we did anyway when fitting the Poisson distribution earlier), if $Y$ is the number of times we see 4 or more homicides in a year, then $Y\sim\operatorname{Binomial}(365,0.0011)$ , and

\displaystyle{\color[rgb]{1,1,1}\Pr[Y\geq 1]=1-\Pr[Y=0]=0.33.}

So there is a chance of approximately $1/3$ that we see 4 or more homicides on at least one day in a given year.

Some R code for the above:

⬇

> #P[X>=4]:

> ans1<-1-ppois(3,momlam)

> #P[Y>=1]:

> ans2<-1-dbinom(0,365,ans1)

12.3 Likelihood Examples: continuous parameters

We now explore examples of likelihood inference for some common models.

TheoremExample 12.3.1 Accident and Emergency

Accident and emergency departments are hard to manage because patients arrive at random (they are unscheduled). Some patients may need to be seen urgently.

Excess staff (doctors and nurses) must be avoided because this wastes NHS money; however, A&E departments also have to adhere to performance targets (e.g. patients dealt with within four hours). So staff levels need to be ‘balanced’ so that there are sufficient staff to meet targets but not too many so that money is not wasted.

A first step in achieving this is to study data on patient arrival times. It is proposed that we model the time between patient arrivals as iid realisations from an Exponential distribution.

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

Suppose I stand outside Lancaster Royal Infirmary A&E and record the following inter-arrival times of patients (in minutes):

18.39, 2.70, 5.42, 0.99, 5.42, 31.97, 2.96, 5.28, 8.51, 10.90.

As usual, the first thing we do is look at the data!

⬇

arrive<-c(18.39,2.70,5.42,0.99,5.42,31.97,2.96,5.28,8.51,10.90)

stripchart(arrive,pch=4,xlab="inter-arrival time (mins)")

The exponential pdf is given by

f(x)=\lambda\exp(-\lambda x),

for $x\geq 0$ , $\lambda\geq 0$ . Assuming that the data are iid, the definition of the likelihood function for $\lambda$ gives us, for general data $x_{1},\ldots,x_{n}$ ,

{\color[rgb]{1,1,1}L(\lambda)=\prod_{i=1}^{n}\lambda\exp(-\lambda x_{i}).}

Note: we usually drop the ‘ $|{\bf x}$ ’ from $L(\lambda|{\bf x})$ whenever possible.

Usually, when we have products and the parameter is continuous, the best way to find the MLE is to find the log-likelihood and differentiate.

So the log-likelihood is

	$\displaystyle l(\lambda)$	$\displaystyle=\sum_{i=1}^{n}\log\left\{\lambda\exp(-\lambda x_{i})\right\}$
		$\displaystyle=\sum_{i=1}^{n}\left\{\log(\lambda)-\lambda x_{i}\right\}$
		$\displaystyle=n\log(\lambda)-\lambda\sum_{i=1}^{n}x_{i}.$

Now we differentiate:

{\color[rgb]{1,1,1}\frac{\partial}{\partial\lambda}l(\lambda)=l^{\prime}(% \lambda)=\frac{n}{\lambda}-\sum_{i=1}^{n}x_{i}.}

Now solutions to $l^{\prime}(\lambda)=0$ are potential MLEs.

1

${\color[rgb]{1,1,1}\frac{n}{\hat{\lambda}}-\sum_{i=1}^{n}x_{i}=0,}$
2

${\color[rgb]{1,1,1}\hat{\lambda}=\frac{n}{\sum_{i=1}^{n}x_{i}}=\frac{1}{\bar{x% }}.}$

To ensure this is a maximum we check the second derivative is negative:

\frac{\partial^{2}}{\partial\lambda^{2}}l(\lambda)=l^{\prime\prime}(\lambda)=-% \frac{n}{\lambda^{2}}<0.

So the solution we have found is the MLE, and plugging in our data we find (via 1/mean(arrive))

\hat{\lambda}=0.108.

Now that we have our MLE, we should check that the assumed model seems reasonable. Here, we will use a QQ-plot.

⬇

#MLE of lambda (rate)

lam<-1/mean(arrive)

#1/(n+1), 2/(n+1),..., n/(n+1).

quant<-seq(from=1/11,to=10/11,length=10)

#produce QQ-plot

qqplot(qexp(quant,rate=lam),arrive,xlab="Theoretical

quantiles",ylab="Actual")

#add line of equality

abline(0,1)

Given the small dataset, this seems ok – there is no obvious evidence of deviation from the exponential model.

Knowing that the exponential distribution is reasonable, and having an estimate for its rate, is useful to calculate staff scheduling requirements in the A&E.

Extensions of the idea consider flows of patients through the various services (take Math332 Stochastic Processes and/or the STOR-i MRes for more on this).

TheoremExample 12.3.2 Is human body temperature really 98.6 degrees Fahrenheit?

In an article by Mackowiak et al.³³Mackowiak P.A., Wasserman, S.S. and Levine, M.M. (1992) A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature and Other Legacies of Carl Reinhold August Wunderlich, the authors measure the body temperatures of a number of individuals to assess whether true mean body temperature is 98.6 degrees Fahrenheit or not. A dataset of $130$ individuals is available in the normtemp dataset. The data are assumed to be normally distributed with standard deviation $0.73$ .

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

What do the data look like?

The plot can be produced using

⬇

> load("normtemp.Rdata")

> hist(normtemp$temperature)

The histogram of the data is reasonable, but there might be some skew in the data (right tail).

The normal pdf is given by

f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{(x_{i}-\mu)^{% 2}}{2\sigma^{2}}\right\},

where in this case, $\sigma$ is known.

The likelihood is then

	$\displaystyle L(\mu\|x_{1},\ldots,x_{n})$	$\displaystyle=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{-\frac{% (x_{i}-\mu)^{2}}{2\sigma^{2}}\right\}$
		$\displaystyle=\left(2\pi\sigma^{2}\right)^{-n/2}\exp\left\{-\sum_{i=1}^{n}% \frac{(x_{i}-\mu)^{2}}{2\sigma^{2}}\right\}.$

Since the parameter of interest (in this case $\mu$ ) is continuous, we can differentiate the log-likelihood to find the MLE:

l(\mu)=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_% {i}-\mu)^{2}

and so

l^{\prime}(\mu)=-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(-2)(x_{i}-\mu).

For candidate MLEs we set this to zero and solve, i.e.

	$\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{n}x_{i}-n\hat{\mu}$	$\displaystyle=0$
	$\displaystyle\sum_{i=1}^{n}x_{i}-n\hat{\mu}$	$\displaystyle=0$

and so the MLE is $\hat{\mu}=\bar{x}$ .

This is also the “obvious” estimate (the sample mean). To check it is indeed an MLE, the second derivative of the log-likelihood is

l^{\prime\prime}(\mu)=-n<0,

which confirms this is the case.

Using the data, we find $\hat{\mu}=\bar{x}=98.25$ .

This might indicate evidence for the body temperature being different from the assumed 98.6 degrees Fahrenheit.

We now check the fit:

⬇

> temps <-normtemp$temperature # for shorthand

> mean(temps)

[1] 98.24923

> stdtemp = (temps - mean(temps))/0.73

> qqnorm(stdtemp)

> abline(0,1) # add "ideal" fit line y = x

The fit looks good – although (as the histogram previously showed) there is possibly some mild right (positive) skew, indicated by the quantile points above the $y=x$ line.

Why might the QQ-plot show the “stepped” behaviour of the points?

TheoremExample 12.3.3

Every day I cycle to Lancaster University, and have to pass through the traffic lights at the crossroads by Booths (heading south down Scotforth Road). I am either stopped or not stopped by the traffic lights. Over a period of a term, I cycle in 50 times. Suppose that the time I arrive at the traffic lights is independent of the traffic light sequence.

On 36 of the 50 days, the lights are on green and I can cycle straight through. Let $\theta$ be the probability that the lights are on green. Write down the likelihood and log-likelihood of $\theta$ , and hence calculate its MLE.

With the usual iid assumption we see that, if $R$ is the number of times the lights are on green then $R\sim\textrm{Binomial}(50,\theta)$ . So we have

\Pr[R=36]={50\choose 36}\theta^{36}(1-\theta)^{14}.

We therefore have, for general $r$ and $n$ ,

L(\theta)={n\choose r}\theta^{r}(1-\theta)^{n-r},

and

l(\theta)={\color[rgb]{1,1,1}K+r\log(\theta)+(n-r)\log(1-\theta)}.

Solutions to $l^{\prime}(\theta)=0$ are potential MLEs:

l^{\prime}(\theta)={\color[rgb]{1,1,1}\frac{r}{\theta}-\frac{n-r}{1-\theta}},

and if $l^{\prime}(\hat{\theta})=0$ we have

	$\displaystyle\frac{r}{\hat{\theta}}=$	$\displaystyle\ \frac{n-r}{1-\hat{\theta}}$
	$\displaystyle\vspace{.5cm}\mbox{i.e.}\qquad\hat{\theta}=$	$\displaystyle\ \frac{\lx@stackrel{{\scriptstyle\phantom{M}}}{{r}}}{n}.$

For this to be an MLE it must have negative second derivative.

l^{\prime\prime}(\theta)=-\frac{r}{\theta^{2}}-\frac{n-r}{(1-\theta)^{2}}<0.

In particular we have $r=36$ and $n=50$ so $\hat{\theta}=36/50$ is the MLE.

Now suppose that over a two week period, on the 14 occasions I get stopped by the traffic lights (they are on red) my waiting times are given by (in seconds)

4.2,6.9,13.7,2.8,19.3,10.4,1.0,19.4,18.6,0.6,4.5,12.9,0.5,16.0.

Assume that the traffic lights remain on red for a fixed amount of time $t_{r}$ , regardless of the traffic conditions.

Given the above data, write down the likelihood of $t_{r}$ , and sketch it. What is the MLE of $t_{r}$ ?

We are going to assume that these waiting times are drawn independently from $\operatorname{Uniform}[0,t_{r}]$ , where $t_{r}$ is the parameter we wish to estimate.

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

Constructing the likelihood for this example is slightly different from those we have seen before. The pdf of the $\operatorname{Uniform}(0,t_{r})$ distribution is

f(x)=\left\{\begin{array}[]{ll}t_{r}^{-1}&\quad 0\leq x\leq t_{r}\\ 0&\quad\text{otherwise}\end{array}\right.

The unusual thing here is that the data enters only through the boundary conditions on the pdf. Another way to write the above pdf is

{\color[rgb]{1,1,1}f(x)=t_{r}^{-1}\mathbf{1}[0\leq x\leq t_{r}],}

where $\mathbf{1}$ is the indicator function.

For data $x_{1},\ldots,x_{n}$ , the likelihood function is then

L(t_{r})={\color[rgb]{1,1,1}\prod_{i=1}^{n}t_{r}^{-1}\mathbf{1}[0\leq x_{i}% \leq t_{r}]}.

We can write this as

	$\displaystyle L(t_{r})$	$\displaystyle=\ t_{r}^{-n}\mathbf{1}\left[\{0\leq x_{1}\leq t_{r}\}\bigcap% \dots\bigcap\{0\leq x_{n}\leq t_{r}\}\right]$
		$\displaystyle=\ t_{r}^{-n}\mathbf{1}[\max(x_{i})\leq t_{r}].$

For our case we have $n=14$ and $\max(x_{i})=19.4$ , so $L(t_{r})=t_{r}^{-14}\mathbf{1}[19.4\leq t_{r}]$ .

We are next asked to sketch this MLE. In R,

⬇

> maxx<-19.4

> #values of t_r to plot

> t<-seq(from=18,to=22,length=1000)

> #likelihood function

> uniflik<-function(t){

> t^(-14)*(t>=19.4)

> }

> plot(t,uniflik(t),type="l")

From the plot it is clear that $\hat{t}_{r}=19.4=\max(x_{i})$ , since this is the value that leads to the maximum likelihood. Notice that solving $l^{\prime}(t_{r})=0$ would not work in this case, since the likelihood is not differentiable at the MLE.

However, on the feasible range of $t_{r}$ , i.e. $\max(x_{i})\leq t_{r}$ , we have

l(t_{r})=-n\log(t_{r}),

and so

l^{\prime}(t_{r})=-n/t_{r}.

Remember that derivatives express the rate of change of a function. Hence since the (log-)likelihood is negative ( $t_{r}>0$ is strictly positive), then the likelihood is decreasing on the feasible range of parameter values.

Since we are trying to maximise the likelihood, this means we should take the minimum over the feasible range as the MLE. The minimum value on the range $\max(x_{i})\leq t_{r}$ is $\hat{t}_{r}=\max(x_{i})=19.4$ .

12.4 Likelihood Examples: discrete parameters

One case where differentiation is clearly not the right approach to use for maximisation is when the parameter of interest is discrete.

TheoremExample 12.4.1 Illegal downloads

A computer network comprises of $m$ computers. The probability of one of these computers to store illegally downloaded files is $0.3$ , independent for each computer. In a particular network it is found that exactly one computer contains illegally downloaded files. Our parameter of interest is $m$ .

What is a suitable model for the data?

What assumptions are being made?

Are these assumptions reasonable?

What is the likelihood of $m$ ?

Let $X\sim Bin(m,0.3)$ be the number of computers in the network that contains illegally downloaded files. Then $\Pr(\text{obs}|m)$ is

L(m)=\Pr(X=1|m)={m\choose 1}0.3^{1}\times 0.7^{m-1}=\frac{0.3}{0.7}0.7^{m}m.

Note that the possible values $m$ can take are $m=1,2,\ldots$ . We can sketch the likelihood for a suitable range of values:

⬇

> mrange<-0:20 # value for m=0 will be zero

> plot(mrange,dbinom(1,mrange,0.3),xlab="m",ylab="L(m)")

From the plot, we can see that the MLE for $m$ is $\hat{m}=3$ . Alternatively, from the likelihood we have

\frac{L(m+1)}{L(m)}=\frac{0.3^{1}\times 0.7^{m}(m+1)}{0.3^{1}\times 0.7^{m-1}m% }=\frac{0.7(m+1)}{m}.

The likelihood is increasing for $L(m+1)>L(m)$ , which is equivalent to $m<7/3$ .

To maximize the likelihood, we want the largest (integer) value of $m$ satisfying this constraint, i.e. $m=2$ , hence $\hat{m}=3$ .

Relative Likelihood intervals

The ratio between two likelihood values is useful to look at for other reasons.

Definition.

Suppose we have data $x_{1},\ldots,x_{n}$ , that arise from a population with likelihood function $L(\theta)$ , with MLE $\hat{\theta}$ . Then the relative likelihood of the parameter $\theta$ is

R(\theta)=\frac{L(\theta|{\bf x})}{L(\hat{\theta}|{\bf x})}.

The relative likelihood quantifies how likely different values of $\theta$ are relative to the maximum likelihood estimate.

Using this definition, we can construct relative likelihood intervals which are similar to confidence intervals.

Definition.

A p% relative likelihood interval for $\theta$ is defined as the set

\left\{\theta\left|R(\theta)\geq\frac{p}{100}\right.\right\}.

TheoremExample 12.4.2 Illegal downloads (cont.)

For example a 50% relative likelihood interval for $m$ in our example would be

	$\displaystyle\left\{m\left\|R(m)\geq 0.5\right.\right\}$	$\displaystyle=\left\{m\left\|\frac{0.3^{1}\times 0.7^{m-1}m}{0.3^{1}\times 0.7^% {2}\times 3}\geq 0.5\right.\right\}$
		$\displaystyle=\left\{m\left\|0.7^{m-3}m\geq 1.5\right.\right\}$

By plugging in different values of $m$ , we see that the relative likelihood interval is $\{1,\ldots,7\}$ . The values in the interval can be seen in the figure below.

TheoremExample 12.4.3 Sequential sampling with replacement: Smarties colours

Suppose we are interested in estimating $m$ , the number of distinct colours of Smarties.

In order to estimate $m$ , suppose members of the class make a number of draws and record the colour.

Suppose that the data collected (seven draws) were:

purple, blue, brown, blue, brown, purple, brown.

We record whether we had a new colour or repeat:

New, New, New, Repeat, Repeat, Repeat, Repeat.

Let $m$ denote the number of unique colours. Then the likelihood function for $m$ given the above data is:

L(m|{\bf x_{1}})=1\times\frac{m-1}{m}\times\frac{m-2}{m}\times\frac{3}{m}% \times\frac{3}{m}\times\frac{3}{m}\times\frac{3}{m}.

If in a second experiment, we observed:

New, New, New, Repeat, New, Repeat, New,

then the likelihood would be:

L(m|{\bf x_{2}})=1\times\frac{m-1}{m}\times\frac{m-2}{m}\times\frac{3}{m}% \times\frac{m-3}{m}\times\frac{4}{m}\times\frac{m-4}{m}.

The MLEs in each case are $\hat{m}=3$ and $\hat{m}=8$ .

The plots below show the respective likelihoods.

R code for plotting these likelihoods:

⬇

> # experiment 1:

> smartlike<-function(m){

> L<-1*(m-1)*(m-2)*(3)*(3)*(3)*(3)/m^6

> }

> mval<-1:15

> plot(mval,smartlike(mval))

> abline(v=3,col=2)

> which.max(smartlike(mval))

⬇

> # Experiment 2:

> # e.g. pink, purple, blue, blue, brown, purple, orange

> smartlike2<-function(m){

> L<-1*(m-1)*(m-2)*(3)*(m-3)*(4)*(m-4)/m^6

> }

> dev.new()

> plot(mval,smartlike2(mval))

> abline(v=8,col=2)

> which.max(smartlike2(mval))

TheoremExample 12.4.4 Brexit opinions

Three randomly selected members of a class of 10 students are canvassed for their opinion on Brexit. Two are in favour of staying in Europe. What can one infer about the overall class opinion?

The parameter in this model is the number of pro-Remain students in the class, $m$ , say. It is discrete, and could take values $0,1,2,\dots,10$ . The actual true unknown value of $m$ is designated by $m_{\text{true}}$ .

Now $\Pr(\text{obs}|m)$ is

\displaystyle\Pr(\text{2 in favour from $m$ and 1 against from $10-m$}).

Now since the likelihood function of $m$ is the probability (or density) of the observed data for given values of $m$ , we have

	$\displaystyle L(m)$	$\displaystyle=\frac{{m\choose 2}{10-m\choose 1}}{{10\choose 3}}$
		$\displaystyle=\frac{m(m-1)(10-m)}{240}$

for $m=2,3,...,9$ .

This function is not continuous (because the parameter $m$ is discrete). It can be maximised but not by differentiation.

⬇

> #likelihood function

> L<-function(m){

> choose(m,2)*choose(10-m,1)/choose(10,3)

> }

> #values of m to plot

> m<-2:9

> plot(m,L(m),pch=4,col="blue")

The maximum likelihood estimate is $\hat{m}=7.$ Note that the points are not joined up in this plot. This is to emphasize the discrete nature of the parameter of interest.

The probability model is an instance of the hypergeometric distribution.

12.5 Summary

{mdframed}

A procedure for modelling and inference:

1

Subject-matter question needs answering.
2

Data are, or become, available to address this question.
3

Look at the data – exploratory analysis.
4

Propose a model.
5

Check the model fits.
6

Use the model to address the question.

1

The likelihood function is the probability of the observed data for instances of a parameter. Often we use the log-likelihood function as it is easier to work with. The likelihood is a function of an unknown parameter.
2

The maximum likelihood estimator (MLE) is the value of the parameter that maximises the likelihood. This is intuitively appealing, and later we will show it is a theoretically justified choice. The MLE should be found using an appropriate maximisation technique.
3

If the parameter is continuous, we can often (but not always) find the MLE by considering the derivative of the log-likelihood. If the parameter is discrete, we usually evaluate the likelihood at a range of possible values.

DON’T JOIN UP POINTS WHEN PLOTTING THE LIKELIHOOD FOR A DISCRETE PARAMETER.

DO NOT DIFFERENTIATE LIKELIHOODS OF DISCRETE PARAMETERS!

Chapter 13 Information and Sufficiency

13.1 Introduction

Last time we looked at some more examples of the method of maximum likelihood. When the parameter of interest, $\theta$ , is continuous, the MLE, $\hat{\theta}$ , can be found by differentiating the log-likelihood and setting it equal to zero. We must then check the second derivative of the log-likelihood is negative (at our candidate $\hat{\theta}$ ) to verify that we have found a maximum.

Definition.

Suppose we have a sample ${\bf x}=x_{1},\ldots,x_{n}$ , drawn from a density ${f}({\bf x}|\theta)$ with unknown parameter $\theta$ , with log-likelihood $l(\theta|{\bf x})$ . The score function, $S(\theta)$ , is the first derivative of the log-likelihood with respect to $\theta$ :

S(\theta|{\bf x})=l^{\prime}(\theta|{\bf x})=\frac{\partial}{\partial\theta}l(% \theta|{\bf x}).

This is just giving a name to something we have already encountered.

As discussed previously, the MLE solves $S(\hat{\theta})=0$ . Here, $f({\bf x}|\theta)$ is being used to denote the joint density of ${\bf x}=x_{1},\ldots,x_{n}$ . For the iid case, $f({\bf x}|\theta)=\prod_{i=1}^{n}f(x_{i}|\theta)$ . Also, $l(\theta|{\bf x})=\log f({\bf x}|\theta)$ . This is all just from the definitions.

Definition.

Suppose we have a sample ${\bf x}=x_{1},\ldots,x_{n}$ , drawn from a density ${f}({\bf x}|\theta)$ with unknown parameter $\theta$ , with log-likelihood $l(\theta|{\bf x})$ . The observed information function, $I_{O}(\theta)$ , is MINUS the second derivative of the log-likelihood with respect to $\theta$ :

I_{O}(\theta|{\bf x})=-l^{\prime\prime}(\theta|{\bf x})=-\frac{\partial^{2}}{% \partial\theta^{2}}l(\theta|{\bf x}).

Remember that the second derivative of $l(\theta)$ is negative at the MLE $\hat{\theta}$ (that’s how we check it’s a maximum!). So the definition of observed information takes the negative of this to give something positive.

The observed information gets its name because it quantifies the amount of information obtained from a sample. An approximate 95% confidence interval for $\theta_{\text{true}}$ (the unobservable true value of the parameter $\theta$ ) is given by

{\color[rgb]{1,1,1}\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},% \hat{\theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right).}

This confidence interval is asymptotic, which means it is accurate when the sample is large. Some further justification on where this interval comes from will follow later in the course.

What happens to the confidence interval as $I_{O}(\hat{\theta})$ changes?

TheoremExample 13.1.1 Mercedes Benz drivers

You may recall the following example from last year.The website MBClub UK (associated with Mercedes Benz) carried out a poll on the number of times taken to pass a driving test. The results were as follows.

Number of failed attempts	0	1	2	$\geq 3$
Observed frequency	147	47	20	5

Table 13.1: Number of times taken for drivers to pass the driving test.

As always, we begin by looking at the data.

⬇

obsdata<-c(147,47,20,5)

barplot(obsdata,names.arg=c(0:2, or more'),

xlab="Number of failed attempts",

ylab="Frequency",col="orange")

Next, we propose a model for the data to begin addressing the question.

It is proposed to model the data as iid (independent and identically distributed) draws from a geometric distribution.

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

The probability mass function (pmf) for the geometric distribution, where $X$ is defined as the number of failed attempts, is given by

\Pr[X=x]=\theta(1-\theta)^{x},

where $x=0,1,2,\dots$ .

Assuming that the people in the ‘3 or more’ column failed exactly three times, the likelihood for general data $x_{1},\ldots,x_{n}$ is

{\color[rgb]{1,1,1}L(\theta)=\prod_{i=1}^{n}\theta(1-\theta)^{x_{i}},}

and the log-likelihood is

	$\displaystyle l(\theta)$	$\displaystyle=\sum_{i=1}^{n}\log\left\{\theta(1-\theta)^{x_{i}}\right\}$
		$\displaystyle=\sum_{i=1}^{n}\{\log(\theta)+x_{i}\log(1-\theta)\}$
		$\displaystyle=n\log(\theta)+\log(1-\theta)\sum_{i=1}^{n}x_{i}.$

The score function is therefore

{\color[rgb]{1,1,1}S(\theta)=l^{\prime}(\theta)=\frac{n}{\theta}-\frac{\sum_{i% =1}^{n}x_{i}}{1-\theta}.}

A candidate for the MLE, $\hat{\theta}$ , solves $S(\hat{\theta})=0$ :

1

${\color[rgb]{1,1,1}\frac{n}{\hat{\theta}}=\frac{\sum_{i=1}^{n}x_{i}}{1-\hat{% \theta}},}$
2

${\color[rgb]{1,1,1}n(1-\hat{\theta})=\hat{\theta}\sum_{i=1}^{n}x_{i},}$
3

${\color[rgb]{1,1,1}n=\hat{\theta}\left(n+\sum_{i=1}^{n}x_{i}\right),}$
4

${\color[rgb]{1,1,1}\hat{\theta}=\frac{n}{n+\sum_{i=1}^{n}x_{i}}.}$

To confirm this really is an MLE we need to verify it is a maximum, i.e. a negative second derivative.

l^{\prime\prime}(\theta)=-\frac{n}{\theta^{2}}-\frac{\sum_{i=1}^{n}x_{i}}{(1-% \theta)^{2}}<0.

In this case the function is clearly negative for all $\theta\in(0,1)$ , if not we would just need to check this is the case at the proposed MLE.

Now plugging in the numbers, $n=219$ and $\sum_{i=1}^{n}x_{i}=0\times 147+1\times 47+2\times 20+3\times 5=102$ , we get

\hat{\theta}=\frac{219}{219+102}=0.682.

This is the same answer as the ‘obvious one’ from intuition.

But now we can calculate the observed information at $\hat{\theta}$ , and use this to construct a 95% confidence interval for $\theta_{\text{true}}$ .

	$\displaystyle I_{O}(\hat{\theta})$	$\displaystyle=-l^{\prime\prime}(\hat{\theta})$
		$\displaystyle=\frac{n}{\hat{\theta}^{2}}+\frac{\sum_{i=1}^{n}x_{i}}{(1-\hat{% \theta})^{2}}$
		$\displaystyle=\frac{219}{0.682^{2}}+\frac{102}{(1-0.682)^{2}}$
		$\displaystyle=1479.5.$

Now the 95% confidence interval is given by

	$\displaystyle(l,u)$	$\displaystyle=\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\hat{% \theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right)$
		$\displaystyle={\color[rgb]{1,1,1}\left(0.682-\frac{1.96}{\sqrt{1479.5}},0.682+% \frac{1.96}{\sqrt{1479.5}}\right)}$
		$\displaystyle={\color[rgb]{1,1,1}(0.631,0.733)}.$

We should also check the fit of the model by plotting the observed data against the theoretical data from the model (with the MLE plugged in for $\theta$ ).

⬇

#value of theta: MLE

mletheta<-0.682

#expected data counts

expdata<-219*c(dgeom(0:2,mletheta),1-pgeom(2,mletheta))

#make plot

barplot(rbind(obsdata,expdata),names.arg=c(0:2, or more'),

xlab="Number of failed attempts",ylab="Frequency",

col=c("orange","red"),beside=T)

#add legend

legend("topright",c("observed","expected"),

col=c("orange","red"),lty=1)

We can do actually do slightly better than this.

We assumed ‘the people in the “3 or more” column failed exactly three times’. With likelihood we don’t need to do this. Remember: the likelihood is just the joint probability of the data. In fact, people in the “3 or more” group have probability

	$\displaystyle\Pr[X\geq 3]$	$\displaystyle=1-(\Pr[X=0]+\Pr[X=1]+\Pr[X=2])$
		$\displaystyle=1-(\theta+(1-\theta)\theta+(1-\theta)^{2}\theta).$

We could therefore write the likelihood more correctly as

L(\theta)=\prod_{i=1}^{n}\Big{\{}\theta(1-\theta)^{x_{i}}\Big{\}}^{z_{i}}\prod% _{i=1}^{n}\Big{\{}1-(\theta+(1-\theta)\theta+(1-\theta)^{2}\theta)\Big{\}}^{1-% z_{i}},

where $z_{i}=1$ if $x_{i}<3$ and $z_{i}=0$ if $x_{i}\geq 3$ .

NOTE: if all we know about an observation $x$ is that it exceeds some value, we say that $x$ is censored. This is an important issue with patient data, as we may lose contact with a patient before we have finished observing them. Censoring is dealt with in more generality MATH335 Medical Statistics.

What is the MLE of $\theta$ using the more correct version of the likelihood?

The term in the second product (for the censored observations) can be seen as a geometric progression with constant term $a=\theta$ and common ratio $r=(1-\theta)$ , and so $\Pr(X\geq 3|\theta)=(1-\theta)^{3}$ (check that this is the case).

Hence the likelihood can be written

	$\displaystyle L(\theta)$	$\displaystyle=\theta^{n_{u}}(1-\theta)^{\sum x_{i}}\left((1-\theta)^{3}\right)% ^{n_{c}}$
		$\displaystyle=\theta^{n_{u}}(1-\theta)^{\sum x_{i}+3n_{c}}$

where the sum of $x_{i}$ ’s only involves the uncensored observations, $n_{u}$ denotes the number of uncensored observations, and $n_{c}$ is the number of censored observations.

The log-likelihood becomes $l(\theta)=n_{u}\log(\theta)+(\sum x_{i}+3n_{c})\log(1-\theta)$ .

Differentiating, the score function is

S(\theta)=l^{\prime}(\theta)=\frac{n_{u}}{\theta}-\frac{\sum x_{i}+3n_{c}}{1-% \theta}.

A candidate MLE solves $l^{\prime}(\hat{\theta})=0$ , giving

	$\displaystyle\frac{n_{u}}{\hat{\theta}}$	$\displaystyle=\frac{\sum x_{i}+3n_{c}}{1-\hat{\theta}}$
	$\displaystyle n_{u}(1-\hat{\theta})$	$\displaystyle=\hat{\theta}\left(\sum x_{i}+3n_{c}\right)$
	$\displaystyle n_{u}$	$\displaystyle=\hat{\theta}\left(n_{u}+\sum x_{i}+3n_{c}\right)$
	$\displaystyle\hat{\theta}$	$\displaystyle=\frac{n_{u}}{n_{u}+\sum x_{i}+3n_{c}}.$

The value of the MLE using these data is $\frac{214}{214+102}=0.677$ .

Compare this to the original MLE of 0.682.

Why is the new estimate different to this?

Why is the difference small?

13.2 Suppression of Information

Last time we introduced the score function (the derivative of the log-likelihood), and the observed information function (MINUS the second derivative of the log-likelihood). The score function is zero at the MLE. The observed information function evaluated at the MLE gives us a method to construct confidence intervals.

We will now study the concept of observed information in more detail.

TheoremExample 13.2.1 Human Genotyping

Humans are a diploid species, which means you have two copies of every gene (one from your father, one from your mother). Genes occur in different forms; this is what leads to some of the different traits you see in humans (e.g. eye colour). Mendelian traits are a special kind of trait that are determined by a single gene.

Having wet or dry earwax is a Mendelian trait. Earwax wetness is controlled by the gene ABCC11 (this gene lives about half way along chromosome 16). We will call the wet earwax version of ABCC11 W, and the dry version w. The wet version is dominant, which means you only need one copy of W to have wet earwax. Both copies of the gene need to be w to get dry earwax.

The Hardy-Weinberg law of genetics states that if W occurs in a (randomly mating) population with proportion $p$ (so w occurs with proportion $(1-p)$ ) potential combinations in humans obey the proportions:

combination	WW	Ww	ww
proportion	$p^{2}$	$2p(1-p)$	$(1-p)^{2}$

Suppose I take a sample of 100 people and assess the wetness of their earwax. I observe that 87 of the people have wet earwax and 13 of them have dry earwax.

I am actually interested in $p$ , the proportion of copies of W in my population.

Show that the probability of a person having wet earwax is $p(2-p)$ , and that the probability of a person having dry earwax is $(1-p)^{2}$ . Also show that these two probabilities sum to $1$ .

The number of people with wet earwax in my sample is therefore $\operatorname{Binomial}(100,p(2-p))$ . So

{\color[rgb]{1,1,1}\Pr[\text{obs}|p]={100\choose 87}\left\{p(2-p)\right\}^{87}% \left\{(1-p)^{2}\right\}^{13}.}

IMPORTANT FACT: when writing down the likelihood, we can always omit multiplicative constants, since they become additive in the log-likelihood, then disappear in the differentation. A multiplicative constant is one that does not depend on the parameter of interest (here $p$ ).

So we can write down the likelihood as

	$\displaystyle L(p)$	$\displaystyle\propto\left\{p(2-p)\right\}^{87}\left\{(1-p)^{2}\right\}^{13}$
		$\displaystyle=\left\{p(2-p)\right\}^{87}(1-p)^{26}.$

So the log likelihood is

	$\displaystyle l(p)$	$\displaystyle=87\log\left\{p(2-p)\right\}+26\log(1-p)$
		$\displaystyle=87\log(p)+87\log(2-p)+26\log(1-p)$

(plus constant).

Now $p$ is a continuous parameter so a suitable way to find a candidate MLE is to differentiate. The score function is

{\color[rgb]{1,1,1}S(p)=l^{\prime}(p)=\frac{87}{p}-\frac{87}{2-p}-\frac{26}{1-% p}.}

We can solve $S(\hat{p})=0$ ; it is as a quadratic in $\hat{p}$ :

1

$87(2-p)(1-p)-87p(1-p)-26p(2-p)=0,$
2

$200\hat{p}^{2}-400\hat{p}+174=0,$
3

$\frac{400\pm\sqrt{400^{2}-4.200.174}}{2.200}=\hat{p}.$

This gives two solutions but we need $\hat{p}\in[0,1]$ as it is a proportion, so get $\hat{p}=0.639$ as our potential MLE.

The second derivative is

l^{\prime\prime}(p)=-\frac{87}{p^{2}}-\frac{87}{(2-p)^{2}}-\frac{26}{(1-p)^{2}}.

This is clearly $<0$ at $\hat{p}$ , confirming that it is a maximum.

The observed information is obtained by substituting $\hat{p}$ into $-l^{\prime\prime}(p)$ , giving

{\color[rgb]{1,1,1}I_{O}(\hat{p})=\frac{87}{0.639^{2}}+\frac{87}{(2-0.639)^{2}% }+\frac{26}{(1-0.639)^{2}}=459.5.}

Hence an approximate 95% confidence interval for $p_{\text{true}}$ is given by

	$\displaystyle(l,u)$	$\displaystyle=\left(\hat{p}-\frac{1.96}{\sqrt{I_{O}(\hat{p})}},\hat{p}+\frac{1% .96}{\sqrt{I_{O}(\hat{p})}}\right)$
		$\displaystyle={\color[rgb]{1,1,1}\left(0.639-\frac{1.96}{\sqrt{459.5}},0.639+% \frac{1.96}{\sqrt{459.5}}\right)}$
		$\displaystyle={\color[rgb]{1,1,1}(0.548,0.730)}.$

After all that derivation, don’t forget the context. This is a 95% confidence interval for the proportion of people with a W variant of ABCC11 gene in the population of interest.

Suppose that, instead of looking in people’s ears to see whether their wax is wet or dry we decide to genotype them instead, thereby knowing whether they are WW, Ww or ww.

This is a considerably more expensive option (although perhaps a little less disgusting) so a natural question is: what do we gain by doing this?

We take the same 100 people and find that 42 are WW, 45 are Ww and 13 are ww. Think about how this relates back to the earwax wetness. Did we need to genotype everyone?

The likelihood function for $p$ given our new information is

	$\displaystyle L(p)\propto$	$\displaystyle\ (p^{2})^{42}\left\{2p(1-p)\right\}^{45}\left\{(1-p)^{2}\right\}% ^{13}$
		$\displaystyle=\ p^{84}\left\{2p(1-p)\right\}^{45}(1-p)^{26}.$

The log-likelihood is

	$\displaystyle l(p)$	$\displaystyle=84\log(p)+45\log\left\{2p(1-p)\right\}+26\log(1-p)$
		$\displaystyle=84\log(p)+45\log(2)+45\log(p)+45\log(1-p)+26\log(1-p)$
		$\displaystyle=129\log(p)+71\log(1-p)+c.$

where $c$ is a constant.

As before, $p$ is continuous so we can find candidates for the MLE by differentiating:

{\color[rgb]{1,1,1}S(p)=l^{\prime}(p)=\frac{129}{p}-\frac{71}{1-p}.}

Now solving $S(\hat{p})=0$ gives a candidate MLE

1

$\frac{129}{\hat{p}}=\frac{71}{1-\hat{p}}$ ,
2

$129(1-\hat{p})=71\hat{p}$ ,

i.e.

\displaystyle\hat{p}=\frac{129}{200}=0.645.

This is our potential MLE. Checking the second derivative

l^{\prime\prime}(p)=-\frac{129}{p^{2}}-\frac{71}{(1-p)^{2}},

which is $<0$ at $\hat{p}$ confirming that it is a maximum.

The observed information is obtained by substituting $\hat{p}$ into $-l^{\prime\prime}(p)$ , giving

I_{O}(\hat{p})=\frac{129}{0.645^{2}}+\frac{71}{(1-0.645)^{2}}=873.5.

Hence an approximate 95% confidence interval for $p_{\text{true}}$ is given by

	$\displaystyle(l,u)$	$\displaystyle=\left(\hat{p}-\frac{1.96}{\sqrt{I_{O}(\hat{p})}},\hat{p}+\frac{1% .96}{\sqrt{I_{O}(\hat{p})}}\right)$
		$\displaystyle=\left(0.645-\frac{1.96}{\sqrt{873.5}},0.645+\frac{1.96}{\sqrt{87% 3.5}}\right)$
		$\displaystyle=(0.579,0.711).$

Now, compare the confidence intervals and the observed informations from the two separate calculations. What do you conclude?

Of course, genotyping the participants of the study is expensive, so may not be worthwhile. If this was a real problem, the statistician could communicate the figures above to the geneticist investigating gene ABCC11, who would then be able to make an evidence-based decision about how to conduct the experiment.

13.3 Sufficiency

Recall the driving test data from the Example 13.1.

Number of failed attempts	0	1	2	$\geq 3$
Observed frequency	147	47	20	5

Table 13.2: Number of times taken for drivers to pass the driving test.

We chose to model these data as being geometrically distributed. Assuming that the people in the ‘3 or more’ column failed exactly three times, the log-likelihood for general data $x_{1},\ldots,x_{n}$ is

	$\displaystyle l(\theta)$	$\displaystyle=\sum_{i=1}^{n}\log\left\{\theta(1-\theta)^{x_{i}}\right\}$
		$\displaystyle=\sum_{i=1}^{n}\{\log(\theta)+x_{i}\log(1-\theta)\}$
		$\displaystyle=n\log(\theta)+\log(1-\theta)\sum_{i=1}^{n}x_{i}.$

Now, suppose that, rather than being presented with the table of passing attempts, you were simply told that with 219 people filling in the survey, $\sum_{i=1}^{219}x_{i}=102$ .

Would it still be possible to proceed with fitting the model?

The answer is yes; moreover, we can proceed in exactly the same way, and achieve the same results! This is because, if you look at the log-likelihood, the only way in which the data is involved is through $\sum_{i=1}^{n}x_{i}$ , meaning that in some sense, this is all we need to know.

This is clearly a big advantage, we just have to remember one number rather than an entire table.

We call $\sum_{i=1}^{n}x_{i}$ a sufficient statistic for $\theta$ .

Definition.

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Then a function of the data $T({\bf x})$ is said to be a sufficient statistic for $\theta$ (or sufficient for $\theta$ ) if ${\bf x}$ is independent of $\theta$ given $T({\bf x})$ , i.e.

\Pr[{\bf X}={\bf x}|T({\bf x}),\theta]=\Pr[{\bf X}={\bf x}|T({\bf x})].

Some consequences of this definition:

1

For the objective of learning about $\theta$ , if I am told $T({\bf x})$ , there is no value in being told anything else about ${\bf x}$ .
2

If I have two datasets ${\bf x_{1}}$ and ${\bf x_{2}}$ , and $T({\bf x_{1}})=T({\bf x_{2}})$ , then I should make the same conclusions about $\theta$ from both, even if ${\bf x_{1}}\neq{\bf x_{2}}$ .
3

Sufficient statistics always exist since trivially $T({\bf x})={\bf x}$ always satisfies the above definition.

Definition.

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Let $T({\bf x})$ be sufficient for $\theta$ . Then $T({\bf x})$ is said to be minimally sufficient for $\theta$ if there is no sufficient statistic with a lower dimension than $T$ .

Theorem (Neyman factorisation theorem).

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Then a function $T({\bf x})$ is sufficient for $\theta$ if and only if the likelihood function can be factorised in the form

L(\theta)=g({\bf x})\times h(T({\bf x}),\theta),

where $g$ is a function of the data only, and $h$ is a function of the data only through $t({\bf x})$ .

For a proof see page 276 of Casella and Berger.

We can also express the factorisation result in terms of the log-likelihood, which is often easier, just by taking logs of the above result:

	$\displaystyle l(\theta)$	$\displaystyle=\log\big{\{}g({\bf x})\times h(T({\bf x}),\theta)\big{\}}$
		$\displaystyle=\log\{g({\bf x})\}+\log\{h(T({\bf x}),\theta)\}$
		$\displaystyle=\tilde{g}({\bf x})+\tilde{h}(T({\bf x}),\theta),$

where $\tilde{g}=\log(g)$ and $\tilde{h}=\log(h)$ .

We can show that $\sum_{i=1}^{n}x_{i}$ is sufficient for $\theta$ in the driving test example by inspection of the log-likelihood:

l(\theta)=n\log(\theta)+\log(1-\theta)\sum_{i=1}^{n}x_{i}.

Letting $T({\bf x})=\sum_{i=1}^{n}x_{i}\$ , then $\tilde{h}(T({\bf x}),\theta)=n\log(\theta)+\log(1-\theta)T({\bf x})$ , and $\tilde{g}({\bf x})=0$ , we have satisfied the factorisation criterion, and hence $T({\bf x})=\sum_{i=1}^{n}x_{i}$ is sufficient for $\theta$ .

Suppose that I carry out another survey on attempts to pass a driving test, again with $n=219$ participants and get data $\vec{y}=y_{1},\ldots,y_{n}$ , with ${\bf x}\neq{\bf y}$ but $\sum_{i=1}^{n}x_{i}=\sum_{i=1}^{n}y_{i}$ . Are the following statements true or false?

1

$\hat{\theta}({\bf x})$ , the MLE based on data ${\bf x}$ , is the same as $\hat{\theta}({\bf y})$ , the MLE based on data ${\bf y}$ .
2

The confidence intervals based on both datasets will be identical.
3

The geometric distribution is appropriate for both datasets.

An important shortcoming in only considering the sufficient statistic is that it does not allow us to check how well the chosen model fits.

TheoremExample 13.3.1 Poisson parameter (cont.)

Recall from the beginning of this section, the London homicides data, which we modelled as a random sample from the Poisson distribution. We found

	$\displaystyle L(\lambda\|x_{1},\ldots,x_{n})$	$\displaystyle=\prod_{i=1}^{n}\frac{\lambda^{x_{i}}\exp(-\lambda)}{x_{i}!}$
		$\displaystyle=\lambda^{\sum_{i}x_{i}}\exp(-n\lambda)\prod_{i=1}^{n}\frac{1}{x_% {i}!}$
		$\displaystyle\propto\lambda^{\sum_{i}x_{i}}\exp(-n\lambda),$

and that the log-likelihood function for the Poisson data is consequently

l(\lambda)=\log(\lambda)\sum_{i=1}^{n}x_{i}-n\lambda+c,

with the MLE being

\hat{\lambda}=\frac{\sum_{i=1}^{n}x_{i}}{n}=\bar{x}.

By differentiating again, we can find the information function

l^{\prime\prime}(\lambda|{\bf x})=-\lambda^{-2}\sum_{i=1}^{n}x_{i},

and so

I_{O}(\lambda|{\bf x})=\lambda^{-2}\sum_{i=1}^{n}x_{i}.

What is a sufficient statistic for the Poisson parameter?

For this case, we can let $T({\bf x})=\sum_{i=1}^{n}x_{i}\$ , and $\tilde{h}(T({\bf x}),\theta)=\log(\lambda)T({\bf x})-n\lambda$ , and $\tilde{g}({\bf x})=c=-\sum_{i=1}^{n}\log(x_{i}!)$ , we have satisfied the factorisation criterion, and hence $T({\bf x})=\sum_{i=1}^{n}x_{i}$ is sufficient for $\lambda$ .

TheoremExample 13.3.2 Normal variance

Suppose the sample $x_{1},\ldots,x_{n}$ comes from $X\sim N(0,\theta)$ . Find a sufficient statistic for $\theta$ . Is the MLE a function of this statistic or of the sample mean? Give a formula for the 95% confidence interval of $\theta$ .

First, the $\operatorname{Normal}(0,\theta)$ density is given by

\displaystyle{\color[rgb]{1,1,1}f(x_{i}|\theta)=\frac{1}{\sqrt{2\pi\theta}}% \exp\left\{-\frac{x_{i}^{2}}{2\theta}\right\},}

leading to the likelihood

1

${\color[rgb]{1,1,1}L(\theta)=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\theta}}\exp% \left\{-\frac{x_{i}^{2}}{2\theta}\right\},}$
2

${\color[rgb]{1,1,1}L(\theta)=\frac{1}{\theta^{n/2}}\exp\left\{-\frac{\sum_{i}x% _{i}^{2}}{2\theta}\right\}.}$

Hence, $T({\bf x})=\sum_{i}x_{i}^{2}$ is a sufficient statistic for $\theta$ . The log-likelihood and score functions are

1

${\color[rgb]{1,1,1}l(\theta)=-\frac{n}{2}\log\theta-\frac{\sum_{i}x_{i}^{2}}{2% \theta},}$
2

${\color[rgb]{1,1,1}S(\theta)=l^{\prime}(\theta)=-\frac{n}{2\theta}+\frac{\sum_% {i}x_{i}^{2}}{2\theta^{2}}.}$

Solving $S(\theta)=0$ gives a candidate MLE

\displaystyle\hat{\theta}=\frac{\sum_{i}x_{i}^{2}}{n},

which is a function of the sufficient statistic. To check this is an MLE we calculate

\displaystyle l^{\prime\prime}(\theta)=\frac{n}{2\theta^{2}}-\frac{\sum_{i}x_{% i}^{2}}{\theta^{3}}.

In this case it isn’t immediately obvious that $l^{\prime\prime}(\hat{\theta})<0$ , but substituting in

	$\displaystyle l^{\prime\prime}(\hat{\theta})$	$\displaystyle=\frac{n}{2\left(\frac{\sum x_{i}^{2}}{n}\right)^{2}}-\frac{\sum x% _{i}^{2}}{\left(\frac{\sum x_{i}^{2}}{n}\right)^{3}}$
		$\displaystyle=\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}-\frac{n^{3}}{(\sum x_{i}^{2}% )^{2}}$
		$\displaystyle=-\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}<0,$

confirming that this is an MLE.

The observed information is $I_{O}(\hat{\theta})=-l^{\prime\prime}(\hat{\theta})$ ,

{\color[rgb]{1,1,1}I_{O}(\hat{\theta})=\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}.}

Therefore a 95% confidence interval is given by

	$\displaystyle(l,u)$	$\displaystyle=\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\ % \hat{\theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right)$
		$\displaystyle={\color[rgb]{1,1,1}\left(\frac{\sum_{i}x_{i}^{2}}{n}-1.96n^{-3/2% }\sum x_{i}^{2}\sqrt{2},\ \ \frac{\sum_{i}x_{i}^{2}}{n}+1.96n^{-3/2}\sum x_{i}% ^{2}\sqrt{2}\right)}.$

13.4 Summary

{mdframed}

1

The score function is the first derivative of the log-likelihood. The observed information is MINUS the second derivative of the log-likelihood. It will always be positive when evaluated at the MLE.

DO NOT FORGET THE MINUS SIGN!
2

The likelihood function adjusts appropriately when more information becomes available. Observed information does what it says. Higher observed information leads to narrower confidence intervals. This is a good thing as narrower confidence intervals mean we are more sure about where the true value lies.
For a continuous parameter of interest, $\theta$ , the calculation of the MLE and its confidence interval follows the steps:
1. 1
  
  Write down the likelihood, $L(\theta)$ .
2. 2
  
  Write down the log-likelihood, $l(\theta)$ .
3. 3
  
  Work out the score function, $S(\theta)=l^{\prime}(\theta)$ .
4. 4
  
  Solve $S(\hat{\theta})=0$ to get a candidate for the MLE, $\hat{\theta}$ .
5. 5
  
  Work out $l^{\prime\prime}(\theta)$ . Check it is negative at the MLE candidate to verify it is a maximum.
6. 6
  
  Work out the observed information, $I_{O}(\hat{\theta})=-l^{\prime\prime}(\theta)$ .
7. 7
  
  Calculate the confidence interval for $\theta_{\text{true}}$ :
  
  $\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\hat{\theta}+\frac{% 1.96}{\sqrt{I_{O}(\hat{\theta})}}\right).$
3

Changing the data that your inference is based on will change the amount of information, and subsequent inference (e.g. confidence intervals).
4

A statistic $T({\bf x})$ is said to be sufficient for a parameter $\theta$ , if ${\bf x}$ is independent of $\theta$ when conditioning on $S({\bf x})$ .
5

An equivalent, and easier to demonstrate condition is that the likelihood can be factorised in the form $L(\theta)=g({\bf x})\times h(T({\bf x}),\theta)$ , iff $T({\bf x})$ is sufficient.

Chapter 14 Distribution of the MLE

14.1 Recalling randomness

We have noted that an asymptotic 95% confidence interval for a true parameter, $\theta$ , is given by

{\color[rgb]{1,1,1}\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},% \hat{\theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right),}

where $\hat{\theta}$ is the MLE and

{\color[rgb]{1,1,1}I_{O}(\theta|{\bf x})=-l^{\prime\prime}(\theta|{\bf x})=-% \frac{\partial^{2}}{\partial\theta^{2}}l(\theta|{\bf x}),}

is the observed information.

In this lecture we will sketch the derivation of the distribution of the MLE, and show why the above really is an asymptotic 95% confidence interval for $\theta$ .

Recall the distinction between an estimate and an estimator.

Given a sample $X_{1},\ldots,X_{n}$ , an estimator is any function $W(X_{1},\ldots,X_{n})$ of that sample. An estimate is a particular numerical value produced by the estimator for given data $x_{1},\ldots,x_{n}$ .

The maximum likelihood estimator is a random variable; therefore it has a distribution. A maximum likelihood estimate is just a number, based on fixed data.

For the rest of this lecture we consider an iid sample $X_{1},\ldots,X_{n}$ , from some distribution with unknown parameter $\theta$ , and the MLE (maximum likelihood estimator) $\hat{\theta}({\bf X})$ .

Definition.

The Fisher information of a random sample $X_{1},\ldots,X_{n}$ is the expected value of minus the second derivative of the log-likelihood, evaluated at the true value of the parameter:

I_{E}(\theta)=\mathbb{E}\left[-\frac{\partial^{2}}{\partial\theta^{2}}l(\theta% |{\bf X})\right].

This is related to, but different from, the observed information.

1

The observed information is calculated based on observed data; the Fisher information is calculated taking expectations over random data.
2

The observed information is calculated at $\hat{\theta}$ , the Fisher information is calculated at $\theta_{\text{true}}$ .
3

The observed information can be written down numerically; the Fisher information usually cannot be since it depends on $\theta_{\text{true}}$ , which is unknown.

TheoremExample 14.1.1 Fisher Information for a Poisson parameter

Suppose ${\bf x}$ is a random sample from $X\sim\operatorname{Poisson}(\theta_{\text{true}})$ . Find the Fisher information. Remember that $\mathbb{E}[X]=\theta_{\text{true}}$ . For $\theta>0$ ,

	$\displaystyle L(\theta)=f({\bf x}\|\theta)$	$\displaystyle=\prod_{i=1}^{n}\frac{e^{-\theta}\theta^{x_{i}}}{x_{i}!}$
		$\displaystyle=e^{-n\theta}\theta^{\sum_{i}x_{i}}\times c$

where $c$ is a constant.

1

$\log f({\bf x}|\theta)={\color[rgb]{1,1,1}-n\theta+\sum_{i}x_{i}\log\theta+c},$
2

$\frac{\partial}{\partial\theta}\log f({\bf x}|\theta)={\color[rgb]{1,1,1}\frac% {\sum_{i}x_{i}}{\theta}-n,}$
3

$\frac{\partial^{2}}{\partial\theta^{2}}\log f({\bf x}|\theta)={\color[rgb]{% 1,1,1}\frac{-\sum_{i}x_{i}}{\theta^{2}},}$
4

$\frac{\partial^{2}}{\partial\theta^{2}}\log f({\bf X}|\theta)={\color[rgb]{% 1,1,1}\frac{-\sum_{i}X_{i}}{\theta^{2}}.}$

Hence

	$\displaystyle I_{E}(\theta_{\text{true}})$	$\displaystyle=\mathbb{E}\left(\frac{\sum_{i}X_{i}}{\theta_{\text{true}}^{2}}\right)$
		$\displaystyle=\frac{n\theta_{\text{true}}}{\theta_{\text{true}}^{2}}=\frac{n}{% \theta_{\text{true}}}.$

We see that our answer is in terms of $\theta_{\text{true}}$ , which is unknown (and not in terms of the data!) The Fisher information is useful for many things in likelihood inference, to see more take MATH330 Likelihood Inference.

Here, it features in the most important theorem in the course.

Theorem (Asymptotic distribution of the maximum likelihood estimator).

Suppose we have an iid sample ${\bf X}=X_{1},\ldots,X_{n}$ from some distribution with unknown parameter $\theta$ , with maximum likelihood estimator $\hat{\theta}({\bf X})$ . Then (under certain regularity conditions) in the limit as $n\rightarrow\infty$

\hat{\theta}({\bf X})\sim\operatorname{N}\left(\theta,I_{E}^{-1}(\theta)\right).

This says that, for $n$ large, the distribution of the MLE is approximately normal with mean equal to the true value of the parameter, and variance equal to the reciprocal of the Fisher information.

We will not prove the result in this course, but it has to do with the central limit theorem (from MATH230).

Turning this around, this means that, for large $n$ ,

\Pr\left[\theta\in\left(\hat{\theta}({\bf X})-1.96\sqrt{I_{E}^{-1}(\theta)},% \hat{\theta}({\bf X})-1.96\sqrt{I_{E}^{-1}(\theta)}\right)\right]\approx 0.95.

This result is useless as it stands, because we can only calculate $I_{E}(\theta)$ when we know $\theta$ , and if we know it, why are we constructing a confidence interval for it?!

Luckily, the result also works asymptotically if we replace $I_{E}(\theta)$ by $I_{O}(\hat{\theta})$ , giving that

\left(\hat{\theta}({\bf x})-\frac{1.96}{\sqrt{I_{O}(\hat{\theta}({\bf x}))}},% \hat{\theta}({\bf x})+\frac{1.96}{\sqrt{I_{O}(\hat{\theta}({\bf x}))}}\right)

is an approximate 95% confidence interval for $\theta$ (as claimed earlier).

Exam Question

A large batch of electrical components contains a proportion $\theta$ which are defective and not repairable, a proportion $3\theta$ which are defective but repairable and a proportion $1-4\theta$ which are satisfactory.

(a)

What values of $\theta$ are admissible?

Fifty components are selected at random (with replacement) from the batch, of which 2 are defective and not repairable, 5 are defective and repairable and 43 are satisfactory.

(b)

Write down the likelihood function, $L(\theta)$ and make a rough sketch of it.
(c)

Obtain the maximum likelihood estimate of $\theta$ .
(d)

Obtain an approximate 95% confidence interval for $\theta$ . A value of $\theta$ equal to 0.02 is believed to represent acceptable quality for the batch. Do the data support the conclusion that the batch is of acceptable quality?

Solution:

a
There are 3 types of component, each giving rise to a constraint on $\theta$ :
1. 1
  
  $0\leq\theta\leq 1$ ,
2. 2
  
  $0\leq 3\theta\leq 1$ ,
3. 3
  
  $0\leq 1-4\theta\leq 1$ ,
as the components each need to have valid probabilities. The third inequality is sufficient for the other two and gives $0\leq\theta\leq 1/4$ .
b

Given the data, the likelihood is

$\displaystyle L(\theta)$ $\displaystyle\propto\theta^{2}(3\theta)^{5}(1-4\theta)^{43}$

$\displaystyle\propto\theta^{7}(1-4\theta)^{43}.$

For the sketch note that $L(0)=L(1/4)=0$ and the function is concave and positive between these two with a maximum closer to $0$ than $1/4$ .
c

To work out the MLE, we differentiate the (log-)likelihood as usual. The log-likelihood is

$l(\theta)=7\log\theta+43\log(1-4\theta).$

Differentiating,

$l^{\prime}(\theta)=\frac{7}{\theta}-\frac{4\times 43}{1-4\theta}.$

A candidate MLE solves $l^{\prime}(\hat{\theta})=0$ , giving $\hat{\theta}=\frac{7}{200}$ .

Moreover,

$l^{\prime\prime}(\theta)=-\frac{7}{\theta^{2}}-\frac{4\times 4\times 43}{(1-4% \theta)^{2}}<0,$

so this is indeed the MLE.
d

The observed information is

$\displaystyle I_{O}(\hat{\theta})$ $\displaystyle=-l^{\prime\prime}(\hat{\theta})$

$\displaystyle=\frac{7}{\hat{\theta}^{2}}+\frac{4\times 4\times 43}{(1-4\hat{% \theta})^{2}}$

$\displaystyle=5714.3+930.2$

$\displaystyle=6644.5.$

So a $95\%$ confidence interval for $\theta$ is

$\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\hat{\theta}+\frac{% 1.96}{\sqrt{I_{O}(\hat{\theta})}}\right)$

$=(0.0110,0.0590).$

As $0.02$ is within this confidence interval there is no evidence of this batch being sub-standard.

14.2 Summary

{mdframed}

1

Under certain regularity conditions, the maximum likelihood estimator has, asymptotically, a normal distribution with mean equal to the true parameter value, and variance equal to the inverse of the Fisher information.
2

The Fisher information is minus the expectation of the second derivative of the log-likelihood evaluated at the true parameter value.
3

Based on this, we can construct approximate 95% confidence intervals for the true parameter value based on the MLE and the observed information.
4

Importantly, this is an asymptotic result so is only approximate. In particular, it is a bad approximation to a 95% confidence interval when the sample size, $n$ , is small.

Chapter 15 Deviance and the LRT

15.1 Deviance-based confidence intervals

In the last lecture we showed that the MLE is asymptotically normally distributed, and we use this fact to construct an approximate 95% confidence interval.

In this lecture we will introduce the concept of deviance, and show that this leads to another way to calculate approximate confidence intervals that have various advantages.

We will begin by showing through an example where things can go wrong with the confidence intervals we know (and love?).

TheoremExample 15.1.1 An evening at the casino

On a fair (European) roulette wheel there is a $1/37$ probability of each number coming up.

In the early 1990s, Gonzalo Garcia-Pelayo believed that casino roulette wheels were not perfectly random, and that by recording the results and analysing them with a computer, he could gain an edge on the house by predicting that certain numbers were more likely to occur next than the odds offered by the house suggested. This he did at the Casino de Madrid in Madrid, Spain, winning 600,000 euros in a single day, and one million euros in total.

Legal action against him by the casino was unsuccessful, it being ruled that the casino should fix its wheel:

http://en.wikipedia.org/wiki/Roulette#Biased_wheels

Suppose I am curious that the number 17 seems to come up on a casino’s roulette wheel more frequently than other numbers. I track it for 30 spins, during which it comes up 2 times. I decide to carry out a likelihood analysis on $p$ , the probability of the number 17 coming up, and its confidence interval.

We propose to model the situation as follows. Let $R$ be the number of times the number 17 comes up in 30 spins of the roulette wheel. We decide to model $R\sim\operatorname{Binomial}(30,p)$ .

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

The probability of the observed data is given by

{\color[rgb]{1,1,1}\Pr[\text{obs}|p]={30\choose 2}p^{2}(1-p)^{28}.}

The likelihood is simply the probability of the observed data, but we can ignore the multiplicative constants, so

{\color[rgb]{1,1,1}L(p)\propto p^{2}(1-p)^{28}.}

The log-likelihood is

l(p)=2\log p+28\log(1-p).

Differentiating,

{\color[rgb]{1,1,1}l^{\prime}(p)=\frac{2}{p}-\frac{28}{1-p}.}

Now remember solutions to $l^{\prime}(p)=0$ are potential MLEs:

1

${\color[rgb]{1,1,1}\frac{2}{\hat{p}}=\frac{28}{1-\hat{p}},}$
2

${\color[rgb]{1,1,1}2-2\hat{p}=28\hat{p},}$
3

${\color[rgb]{1,1,1}\hat{p}=\frac{2}{30}.}$

The second derivative will both tell us whether this is a maximum, and provide the observed information:

l^{\prime\prime}(p)=-\frac{2}{p^{2}}-\frac{28}{(1-p)^{2}}.

This is clearly negative for all $p\in(0,1)$ , so $\hat{p}$ must be a maximum.

Moreover, the observed information is

	$\displaystyle I_{O}(\hat{p})=-l^{\prime\prime}(\hat{p})$	$\displaystyle=\frac{2}{\hat{p}^{2}}+\frac{28}{(1-\hat{p})^{2}}$
		$\displaystyle=450+32.413$
		$\displaystyle=482.143.$

A 95% confidence interval for $p$ is given by

\left(\hat{p}-\frac{1.96}{\sqrt{I_{O}(\hat{p})}},\hat{p}+\frac{1.96}{\sqrt{I_{% O}(\hat{p})}}\right),

which, on substituting in $\hat{p}$ and the observed information becomes

\left(\frac{2}{30}-\frac{1.96}{\sqrt{482.143}},\frac{2}{30}+\frac{1.96}{\sqrt{% 482.143}}\right)=(-0.023,0.156).

The resulting confidence interval includes negative values (for a probability parameter). What’s the problem??

Let’s look at a plot of the log-likelihood for the above situation.

⬇

loglik<-function(p){

2*log(p) + 28*log(1-p)

}

p<-seq(from=0.01,to=0.25,length=1000)

plot(p,loglik(p),type="l")

We notice that the log-likelihood is quite asymmetric. This happens because the MLE is close to the edge of the feasible space (i.e. close to 0). The confidence interval defined above is forced to be symmetric, which seems inappropriate here.

Definition.

Suppose we have a log-likelihood function with unknown parameter $\theta$ , $l(\theta)$ . Then the deviance function is given by

D(\theta)=2\big{\{}l(\hat{\theta})-l(\theta)\big{\}}.

Notice that $D(\theta)\geq 0$ , and $D(\hat{\theta})=0$ .

What can we say about $D(\theta)$ ?

This is a fixed (but unknown) value for fixed data ${\bf x}=x_{1},\ldots,x_{n}$ . However, in similar spirit to the last lecture, we can consider random data ${\bf X}=X_{1},\ldots,X_{n}$ . Now, the deviance function depends on ${\bf X}$ (since different data leads to different likelihoods). So, $D(\theta,{\bf X})$ is a random variable.

Theorem 2 (Asymptotic distribution of the deviance).

Suppose we have an iid sample ${\bf X}=X_{1},\ldots,X_{n}$ from some distribution with unknown parameter $\theta$ . Then (under certain regularity conditions) in the limit as $n\rightarrow\infty$ ,

D(\theta,{\bf X})\sim\chi^{2}_{1},

i.e. the deviance of the true value of $\theta$ has a $\chi^{2}$ distribution with one degree of freedom.

The practical upshot of this result is that we have another way to construct a confidence interval for $\theta$ . A 95% confidence interval for $\theta$ , for example, is given by ${\color[rgb]{1,1,1}\{\theta:D(\theta)<3.84\}}$ , i.e. any values of $\theta$ whose deviance is smaller than 3.84.

TheoremExample 15.1.2 An evening at the casino continued

This property of the deviance is best seen visually. Going back to the roulette data:

⬇

deviance<-function(p){

2*(loglik(2/30)-loglik(p))

}

plot(p,deviance(p),type="l")

abline(h=3.84)

From the graph we can estimate the confidence interval based on the deviance. In fact the exact answer to three decimal places is $(0.011,0.192)$ . Notice that this is not symmetrical, and that all values in the interval are feasible.

The original motivation for all of this was that we were wondering if the number 17 comes up more often than with the 1/37 that should be observed in a fair roulette wheel.

In fact 1/37=0.027, which is within the 95% confidence interval calculated above. Hence there is insufficient evidence (so far) to support the claim that this number is coming up more often than it should.

Notes (summary)

1

We have now seen two different ways to calculate approximate confidence intervals (CI) for an unknown parameter. Previously, we calculated CI based on the asymptotic distribution of the MLE (CI-MLE). Here, we showed how to calculate the CI based on the asymptotic distribution of the deviance (CI-D).
2

We discussed various differences and pros and cons of the two:
1. 1
  
  CI-MLE is always symmetric about the MLE. CI-D is not.
2. 2
  
  CI-MLE can include values with zero likelihood (e.g. infeasible values such as negative probabilities, as seen here). CI-D will only include feasible values.
3. 3
  
  CI-D is typically harder to calculate than CI-MLE.
4. 4
  
  For reasons we will not go into here, CI-D is typically more accurate than CI-MLE.
5. 5
  
  CI-D is invariant to re-parametrization; CI-MLE is not. (This is a good thing for CI-D, that we will learn more about in subsequent lectures).
3

Overall, CI-D is usually preferred to CI-MLE (since the only disadvantage is that it is harder to compute).
4

DEVIANCES ARE ALWAYS NON-NEGATIVE!

15.2 Re-parametrization and Invariance

TheoremExample 15.2.1 Accident and Emergency continued

In our likelihood examples we discussed modelling inter-arrival times at an A&E department using an Exponential distribution. The exponential pdf is given by

f(x)=\lambda\exp(-\lambda x)

for $x\geq 0$ and $\lambda\geq 0$ , where $\lambda$ is the rate parameter.

Based on the inter-arrival times (in minutes):

18.39,2.70,5.42,0.99,5.42,31.97,2.96,5.28,8.51,10.90,

giving $\bar{x}=9.259$ , we came up with the MLE for $\lambda$ of $\hat{\lambda}=\frac{1}{\bar{x}}=0.108$ .

Now, $\mathbb{E}[X]=\mu=1/\lambda$ .

How would we go about finding an estimate for $\mu$ ?

Method 1: re-write the pdf as

{\color[rgb]{1,1,1}f(x)=\frac{1}{\mu}\exp\left(-\frac{x}{\mu}\right),}

where $x\geq 0$ and $\mu\geq 0$ , to give a likelihood of

{\color[rgb]{1,1,1}L(\mu)=\prod_{i=1}^{n}\frac{1}{\mu}\exp\left(-\frac{x_{i}}{% \mu}\right),}

then find the MLE by the usual approach.

Method 2: Since $\mu=1/\lambda$ , presumably $\hat{\mu}=1/\hat{\lambda}=1/0.108=9.259$ .

Which method is more convenient?

Which method appears more rigorous?

In fact, both methods give the same solution always. This property is called invariance to reparameterization of the MLE. It is a nice property both because it agrees with our intuition, and saves us a lot of potential calculation.

Theorem (Invariance of MLE to reparametrisation.).

If $\hat{\theta}$ is the MLE of $\theta$ and $\phi$ is a monotonic function of $\theta$ , $\phi=g(\theta)$ , then the MLE of $\phi$ is $\hat{\phi}=g(\hat{\theta})$ .

Proof.

Write ${\bf x}=(x_{1},x_{2},\dots,x_{n})$ . The likelihood for $\theta$ is $L(\theta)=f(\vec{x}|\theta)$ , and for $\phi$ is $L_{\phi}(\phi)$ . Note that $\theta=g^{-1}(\phi)$ as $g$ is monotonic and define $\hat{\phi}$ by $g(\hat{\theta})$ . To show that $\hat{\phi}$ is the MLE,

	$\displaystyle L_{\phi}(\phi)$	$\displaystyle=f({\bf x}\|\phi)$
		$\displaystyle=f({\bf x}\|g^{-1}(\phi))\leq f({\bf x}\|\hat{\theta})$

as $\hat{\theta}$ is MLE. But

	$\displaystyle f({\bf x}\|\hat{\theta})$	$\displaystyle=f({\bf x}\|g^{-1}(\hat{\phi}))$
		$\displaystyle=f({\bf x}\|\hat{\phi})=L_{\phi}(\hat{\phi}).$

∎

This means that both methods above must give the same answer.

Exercise.

Show this works for the case above, by demonstrating that Method 1 leads to $\hat{\mu}=\bar{x}=9.259$ .

The following corollary follows immediately from invariance of the MLE to reparametrisation.

Corollary.

Confidence intervals based on the deviance are invariant to reparametrisation, in the sense that

\{\phi:D(g^{-1}(\phi))\leq 3.84\}=\{\theta:D(\theta)\leq 3.84\}.

Proof.

	$\displaystyle\{\theta:D(\theta)\leq 3.84\}$	$\displaystyle={\color[rgb]{1,1,1}\{\theta:2(l(\hat{\theta})-l(\theta))\leq 3.8% 4\}}$
		$\displaystyle=\{\phi:2(l(g^{-1}(\hat{\phi}))-l(g^{-1}(\phi)))\leq 3.84\}$

by Theorem above, which equals

\displaystyle{\color[rgb]{1,1,1}\{\phi:D(g^{-1}(\phi))\leq 3.84\}.}

∎

The practical consequence of this is that if

$(\theta_{l},\theta_{u})$ is a deviance confidence interval with coverage $p$ for $\theta_{\text{true}}$ ,

then

$(g(\theta_{l}),g(\theta_{u}))$ is a deviance confidence interval with coverage $p$ for $\phi_{\text{true}}$ .

(Of course, $\phi=g(\theta)$ ).

IMPORTANT: This simple translation does not hold for confidence intervals based on the asymptotic distribution to the MLE. This is because that depends on the second derivative of $l(\cdot)$ with respect to the parameter, which will be different in more complicated ways for different parameter choices.

This will be explored more in MATH330 Likelihood Inference.

Exam Question

a

The random variables $X_{1},X_{2},\ldots,X_{n}$ are independent and identically distributed with the geometric distribution

$f(x|\theta)=\theta^{x}(1-\theta),\ x=0,1,2,\ldots$

where $\theta$ is a parameter in the range of $0\leq\theta\leq 1$ to be estimated. The mean of the above geometric distribution is $\theta/(1-\theta)$ .
1. i
  
  Write down formulae for the maximum likelihood estimator for $\theta$ and for Fisher’s information;
2. ii
  
  Write down what you know about the distribution of the maximum likelihood estimator for this example when $n$ is large.
b

In a particular experiment, $n=10$ , $\sum_{i=1}^{n}x_{i}=10$ .
1. i
  
  Compute an approximate 95% confidence interval for $\theta$ based on the asymptotic distribution of the maximum likelihood estimator;
2. ii
  
  Compute the deviance $D(\theta)$ and sketch it over the range $0.1\leq\theta\leq 0.9$ . Use your sketch to describe how to use the deviance to obtain an approximate 95% confidence interval for $\theta$ ;
3. iii
  
  If you were asked to produce an approximate 95% confidence interval for the mean of the distribution $\theta/(1-\theta)$ , what would be your recommended approach? Justify your answer.

Solution:

a
1. i
  
  For the model, the likelihood function is
  
  $\displaystyle L(\theta|{\bf X})$ $\displaystyle=\prod_{i=1}^{n}\theta^{X_{i}}(1-\theta)$
  
  $\displaystyle=(1-\theta)^{n}\theta^{\sum X_{i}}.$
  
  The log-likelihood is then
  
  $l(\theta|{\bf X})=n\log(1-\theta)+\sum X_{i}\log(\theta),$
  
  with derivative
  
  $l^{\prime}(\theta|{\bf X})=-\frac{n}{1-\theta}+\frac{\sum X_{i}}{\theta}.$
  
  A candidate MLE solves $l^{\prime}(\hat{\theta})=0$ , giving
  
  $\hat{\theta}=\frac{\sum X_{i}}{n+\sum X_{i}}.$
  
  Moreover,
  
  $l^{\prime\prime}(\theta|{\bf X})=-\frac{n}{(1-\theta)^{2}}-\frac{\sum X_{i}}{% \theta^{2}}<0,$
  
  so this is indeed the MLE.
  
  For the Fisher Information,
  
  $\displaystyle I_{E}(\theta)$ $\displaystyle=\mathbb{E}\left[-l^{\prime\prime}(\theta|{\bf X})|_{\theta=% \theta}\right]$
  
  $\displaystyle=\mathbb{E}\left[\frac{n}{(1-\theta)^{2}}-\frac{\sum X_{i}}{% \theta^{2}}\right]$
  
  $\displaystyle=\frac{n}{(1-\theta)^{2}}+\frac{n}{\theta^{2}}\mathbb{E}[X_{1}]$
  
  $\displaystyle=\frac{n}{\theta_{\text{true}}(1-\theta)^{2}},$
  
  after simplification, since $\mathbb{E}[X_{1}]=\theta/(1-\theta)$ .
2. ii
  
  Using the Fisher information, the asymptotic distribution of the MLE is
  
  $\hat{\theta}({\bf X})\sim N(\theta_{\text{true}},I_{E}^{-1}(\theta_{\text{true% }}))\approx N(\theta_{\text{true}},I_{0}^{-1}(\hat{\theta})).$
b
1. i
  
  Using the data, the MLE is $\hat{\theta}=\frac{10}{10+10}=\frac{1}{2}$ . The observed information is
  
  $I_{O}(\hat{\theta})=\frac{10}{(1-1/2)^{2}}+\frac{10}{(1/2)^{2}}=80.$
  
  Therefore a $95\%$ confidence interval is
  
  $(1/2-\frac{1.96}{\sqrt{80}},1/2+\frac{1.96}{\sqrt{80}})=(0.281,0.719).$
2. ii
  
  The deviance is given by
  
  $\displaystyle D(\theta)$ $\displaystyle=2\{l(\hat{\theta})-l(\theta)\}$
  
  $\displaystyle=2\{10\log(1/2)+10\log(1/2)-10\log(1-\theta)-10\log(\theta)\}$
  
  $\displaystyle=20\left(-2\log 2-\log(\theta(1-\theta))\right).$
  
  To plot the deviance calculate $D(0.1)$ and $D(0.9)$ , and note that $D(0.5)=D(\hat{\theta})=0$ . A $95\%$ confidence interval is obtained by drawing a horizontal line at 3.84; the interval is all $\theta$ with $D(\theta)\leq 3.84$ .
3. iii
  
  To construct a confidence interval for the mean, we would use the mean function on the deviance-based confidence interval just calculated, as this is invariant to re-parametrization.

15.3 Summary

{mdframed}

1

The simple, intuitive answer is true: if $\hat{\theta}$ is a MLE, then if $\phi=g(\theta)$ for any monotonic transformation $g$ , then $\hat{\phi}=g(\hat{\theta})$ .
2

The same simple result can be applied to confidence intervals based on the deviance, but can NOT be applied to confidence intervals based on the asymptotic distribution of the MLE.

	$\displaystyle L(\theta)$	$\displaystyle\propto\theta^{2}(3\theta)^{5}(1-4\theta)^{43}$
		$\displaystyle\propto\theta^{7}(1-4\theta)^{43}.$

	$\displaystyle I_{O}(\hat{\theta})$	$\displaystyle=-l^{\prime\prime}(\hat{\theta})$
		$\displaystyle=\frac{7}{\hat{\theta}^{2}}+\frac{4\times 4\times 43}{(1-4\hat{% \theta})^{2}}$
		$\displaystyle=5714.3+930.2$
		$\displaystyle=6644.5.$

	$\displaystyle L(\theta\|{\bf X})$	$\displaystyle=\prod_{i=1}^{n}\theta^{X_{i}}(1-\theta)$
		$\displaystyle=(1-\theta)^{n}\theta^{\sum X_{i}}.$

	$\displaystyle I_{E}(\theta)$	$\displaystyle=\mathbb{E}\left[-l^{\prime\prime}(\theta\|{\bf X})\|_{\theta=% \theta}\right]$
		$\displaystyle=\mathbb{E}\left[\frac{n}{(1-\theta)^{2}}-\frac{\sum X_{i}}{% \theta^{2}}\right]$
		$\displaystyle=\frac{n}{(1-\theta)^{2}}+\frac{n}{\theta^{2}}\mathbb{E}[X_{1}]$
		$\displaystyle=\frac{n}{\theta_{\text{true}}(1-\theta)^{2}},$

	$\displaystyle D(\theta)$	$\displaystyle=2\{l(\hat{\theta})-l(\theta)\}$
		$\displaystyle=2\{10\log(1/2)+10\log(1/2)-10\log(1-\theta)-10\log(\theta)\}$
		$\displaystyle=20\left(-2\log 2-\log(\theta(1-\theta))\right).$

Style control - access keys in brackets

6.4 Multiple linear regression

TheoremExample 6.4.1 Birthweight cont.

TheoremExample 6.4.2 Gas consumption cont.

Remark.

TheoremExample 6.4.3 Brain weight cont.

TheoremExample 6.4.4 Cereal prices

Definition (Multiple linear regression model).

Remark.

Remark.

Response variables

Factors

Definition.

6.4.1 Examples

TheoremExample 6.4.5 Birth weights cont.

Remark.

TheoremExample 6.4.6 Gas consumption

6.5 Summary

Chapter 7 Linear regression - fitting

7.1 Estimation of regression coefficients β

Remark 1.

Remark 2.

Remark 3.

7.1.1 Examples

TheoremExample 7.1.1 Birth weights cont.

TheoremExample 7.1.2 Birth weight model in R

TheoremExample 7.1.3 Gas consumption cont.

Remark.

Remark.

7.2 Predicted values

TheoremExample 7.2.1 Birthweights cont.

7.2.1 Estimation of residual variance σ2

TheoremExample 7.2.2 Birth weights cont.

Remark.

7.3 Summary

Chapter 8 Sampling distribution of estimators

8.1 Regression coefficients

8.1.1 Expectation of least squares estimator

8.1.2 Variance of least squares estimator

Remark.

Remark.

8.2 Linear combinations of regression coefficients

TheoremExample 8.2.1 Birth weights cont.

8.3 Residual error

8.4 Hypothesis tests for the regression coefficients

Linear combinations of the regression coefficients

TheoremExample 8.4.1 Birth weights cont.

TheoremExample 8.4.2 Gas consumption cont.

8.5 Confidence intervals for the regression coefficients

TheoremExample 8.5.1 Birth weights: confidence interval and two tailed test

TheoremExample 8.5.2 Birth weights: confidence interval and one-tailed test

8.6 Summary

Chapter 9 Explanatory variables: some interesting issues

9.1 Collinearity

TheoremExample 9.1.1 Cereal prices

9.2 Interactions

9.2.1 Interaction between two factors

9.2.2 Interaction between a factor and a covariate

9.3 Summary

Chapter 10 Covariate selection

Definition (Nesting).

TheoremExample 10.0.1 Brain weights

10.1 The F test

Remark.

TheoremExample 10.1.1 Brain weights cont.

Remark.

10.1.1 Where does the F-test come from?

10.2 Link to one-way ANOVA

10.3 Summary

Chapter 11 Diagnostics

11.1 Normality of residuals

Remark.

TheoremExample 11.1.1 Brain weights cont.

Remark.

Remark.

Remark.

11.2 Residuals vs. Fitted values

Remark.

TheoremExample 11.2.1 Brain weights cont.

11.3 Residuals vs. Explanatory variables

7.1 Estimation of regression coefficients $\beta$

7.2.1 Estimation of residual variance $\sigma^{2}$