We shall use the method of least squares to estimate the vector of regression coefficients . For a linear regression model, this approach to parameter estimation gives the same parameter estimates as the method of maximum likelihood, which will be discussed in weeks 7–10 of this course.
The basic idea of least squares estimation idea is to find the estimate which minimises the sum of squares function
(7.1) |
We can rewrite the linear regression model in terms of the residuals as
By replacing with , can be interpreted as the sum of squares of the observed residuals. In general, the sum of squares function is a function of unknown parameters, . To find the parameter values which minimise the function, we calculate all first-order derivatives, set these derivatives equal to zero and solve simultaneously.
Using definition (7.1), the -th first-order derivative is
(7.2) |
We could solve the resulting system of equations by hand, using e.g. substitution. Since this is time consuming we instead rewrite our equations using matrix notation. The -th first-order derivative corresponds to the -th element of the vector
Thus to find we must solve the equation,
Multiplying out the brackets gives
which can be rearranged to
Multiplying both sides by gives the least squares estimates for ,
This is one of the most important results of the course!
In order for the least squares estimate (7.3) to exist, must exist. In other words, the matrix must be non-singular,
is non-singular iff it has linearly independent columns;
This occurs iff has linearly independent columns;
Consequently, explanatory variables must be linearly independent;
This relates back to the discussion on factors in Section 6.4. Linear dependence occurs if
An intercept term and the indicator variables for all levels of a factor are include in the model; since the columns representing the indicator variables sum to the column of 1’s.
The indicator variables for all levels of two or more factors are included in a model; since the columns representing the indicator variables sum to the column of 1’s for each factor.
Consequently it is safest to include an intercept term and indicator variables for levels of each -level factor.
If you want to bypass completely the summation notation used above, the sum of squares function (7.1) can be written as
(7.3) |
Now and since both and are scalars (can you verify this?) we have that .
Hence,
Differentiating with respect to gives the vector of first-order derivatives
as before.
To prove that minimises the sum of squares function we must check that the matrix of second derivatives is positive definite at .
This is the multi-dimensional analogue to checking that the second derivative is positive at the minimum of a function in one unknown.
Returning once more to summation notation,
This is the -th element of the matrix . Thus the second derivative of is .
To prove that is positive definite, we must show that for all non-zero vectors .
Since can be written as the product of a vector and its transpose, , the result follows immediately.