Home page for accesible maths

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

7.1 Estimation of regression coefficients β

We shall use the method of least squares to estimate the vector of regression coefficients β. For a linear regression model, this approach to parameter estimation gives the same parameter estimates as the method of maximum likelihood, which will be discussed in weeks 7–10 of this course.

The basic idea of least squares estimation idea is to find the estimate β^ which minimises the sum of squares function

S(β)=i=1n(yi-β1xi,1--βpxi,p)2. (7.1)

We can rewrite the linear regression model in terms of the residuals as

ϵi=Yi-β1xi,1--βpxi,p

By replacing Yi with yi, S(β) can be interpreted as the sum of squares of the observed residuals. In general, the sum of squares function S(β) is a function of p unknown parameters, β1,,βp. To find the parameter values which minimise the function, we calculate all p first-order derivatives, set these derivatives equal to zero and solve simultaneously.

Using definition (7.1), the j-th first-order derivative is

δSδβj=-2i=1nxi,j(yi-β1xi,1--βpxi,p). (7.2)

We could solve the resulting system of p equations by hand, using e.g. substitution. Since this is time consuming we instead rewrite our equations using matrix notation. The j-th first-order derivative corresponds to the j-th element of the vector

-2X(y-Xβ).

Thus to find β^ we must solve the equation,

-2X(y-Xβ^)=0.

Multiplying out the brackets gives

-2Xy+2XXβ^=0

which can be rearranged to

XXβ^=Xy.

Multiplying both sides by (XX)-1 gives the least squares estimates for β^,

{mdframed}
β^=(XX)-1Xy.

This is one of the most important results of the course!

Remark 1.

In order for the least squares estimate (7.3) to exist, (XX)-1 must exist. In other words, the p×p matrix XX must be non-singular,

  • 1

    XX is non-singular iff it has linearly independent columns;

  • 2

    This occurs iff X has linearly independent columns;

  • 3

    Consequently, explanatory variables must be linearly independent;

  • 4

    This relates back to the discussion on factors in Section 6.4. Linear dependence occurs if

    • 1

      An intercept term and the indicator variables for all levels of a factor are include in the model; since the columns representing the indicator variables sum to the column of 1’s.

    • 2

      The indicator variables for all levels of two or more factors are included in a model; since the columns representing the indicator variables sum to the column of 1’s for each factor.

    Consequently it is safest to include an intercept term and indicator variables for p-1 levels of each p-level factor.

Remark 2.

If you want to bypass completely the summation notation used above, the sum of squares function (7.1) can be written as

S(β)=(y-Xβ)(y-Xβ)=yy-βXy-yXβ+βXXβ. (7.3)
  • 1

    Now βXy=(yXβ) and since both βXy and yXβ are scalars (can you verify this?) we have that βXy=yXβ.

  • 2

    Hence,

    S(β)=yy-2βXy+βXXβ.
  • 3

    Differentiating with respect to β gives the vector of first-order derivatives

    -2Xy+2XXβ=-2X(y-Xβ)

    as before.

Remark 3.

To prove that β^ minimises the sum of squares function we must check that the matrix of second derivatives is positive definite at β^.

  • 1

    This is the multi-dimensional analogue to checking that the second derivative is positive at the minimum of a function in one unknown.

  • 2

    Returning once more to summation notation,

    δ2Sδβkβj=2i=1nxi,jxi,k.
  • 3

    This is the (j,k)-th element of the matrix XX. Thus the second derivative of S(β) is XX.

  • 4

    To prove that XX is positive definite, we must show that zXXz>0 for all non-zero vectors z.

  • 5

    Since zXXz can be written as the product of a vector and its transpose, (Xz)Xz, the result follows immediately.