6 Linear predictor and model formula

6.3 Factors for categorical variables

Linear models explain the values of the dependent variable by means of a linear combination of the explanatory variables, the linear predictor. Determining a linear relationship with numerical explanatory variables is easy to derive and understand. But how should categorical variables be considered?

Consider the example of investigating the relationship of weight to height among a population of school children and its dependence on the gender of the child. Gender is unlike height as it is not measured numerically. However, we can relate the child’s gender to a Boolean TRUE/FALSE variable by asking the question: Is the child male?

We can represent the Boolean variables numerically as 1 if TRUE or 0 if FALSE. This numerical representation are called indicator variables. For the school example, let gi be the gender indicator variable for the ith child where gi=1 if the child is male or gi=0 if the child is female. Combining this with the height explanatory variable (denoted by hi) obtains the linear predictor for the expected weight of child i:

ηi=β0+β1hi+β2gi

 
Exercise 6.48
Write the linear predictor for a male and a female child. Explain the effect β2 has on the linear predictor.

 

Definition 6.3.1.

A qualitative variable that takes a finite number of non-numerical values is categorical. The values are sometimes called levels. The subspace spanned by the indicator vectors for the levels is a factor. (Sometimes, though loosely, we take factor to refer directly to the variable.)

Returning to the school example, the categorical variable gender has two levels, either male or female.

Categorical variables may consist of more than two variables, such as blood type where a patient is one of the levels {O,A,B,AB}. The factor for this variable is defined by four indicator variables, e.g.:

bij={1if patient i has blood group j,0otherwise,forj{O,A,B,AB}.

The distinction between factor and numerical variables is usually straightforward but some variables can fall in the middle. For instance, a survey on household bills may ask how many hours in a day do you use a mobile phone with options ‘0hr-1hr’, ‘1hr-2hr’, ‘2hr-3hr’ or ‘3hr+’. To add to the confusion, the data could be encoded in the database as 1, 2, 3, and 4 for the four levels respectively.

 
Exercise 6.49
What assumptions are being made about the relationship between phone usage categories if is treated as a numerical variable?

 

6.3.1 Linear predictor with categorical variables

Suppose we have three categorical variables: A with four levels, B with 2 levels and C also with two levels. Measurements from n=6 units were collected as follows:

unitABC1A2B2C12A1B2C23A1B2C14A2B1C25A4B1C16A3B1C2

Each unit takes one and only one level of each factor. This information can be converted into indicators

unit𝐚1𝐚2𝐚3𝐚4𝐛1𝐛2𝐜1𝐜2101000110210000101310000110401001001500011010600101001

Here, 𝐚1 indicates level A1 of factor A and occurs for units 2 and 3. Also, 𝐜2 indicates level C2 of factor C and occurs on units 2, 4 and 6. It follows that the span for each categorical variable is:

𝒜=span(𝐚1,𝐚2,𝐚3,𝐚4),=span(𝐛1,𝐛2),𝒞=span(𝐜1,𝐜2).

This procedure generalizes to an arbitrary number of factors with arbitrary numbers of levels in the obvious way.

However, since each unit must fall in exactly one level for each categorical variable, we note the relationships:

𝐚1+𝐚2+𝐚3+𝐚4=𝟏,𝐛1+𝐛2=𝟏,𝐜1+𝐜2=𝟏.

Evidently, there is a linear relationship between the columns in the above table, e.g. 𝐛1+𝐛2=𝐜1+𝐜2. This is not an ideal property of the design matrix as there will be dependence between the co-efficients.

From the above relationship, we are able to write one of the levels in terms of the others for each of the categorical variables. For example, 𝐚1=𝟏-𝐚2-𝐚3-𝐚4. This means we are able to provide an alternative span for each categorical variable:

𝒜=span(𝟏,𝐚2,𝐚3,𝐚4),=span(𝟏,𝐛2),𝒞=span(𝟏,𝐜2).

The linear predictor 𝜼 is therefore an element of the space spanned by the sum of these subspaces:

𝜼𝒜++𝒞=span(𝟏,𝐚2,𝐚3,𝐚4,𝐛2,𝐜2).

These indicator vectors are linearly independent and so provide an ideal basis for defining the design matrix. For the example above, the design matrix is:

X=[𝟏𝐚2𝐚3𝐚4𝐛2𝐜2110010100011100010110001100100101001]

 
Exercise 6.50
The linear predictor corresponding to the design matrix X is

𝜼=β0+β1𝐚2+β2𝐚3+β3𝐚4+β4𝐛2+β5𝐜2.

Explain the meaning of each co-efficient.