6 Linear predictor and model formula 6.2 Model formulae for continuous variables 6.4 Interaction

6.3 Factors for categorical variables

Linear models explain the values of the dependent variable by means of a linear combination of the explanatory variables, the linear predictor. Determining a linear relationship with numerical explanatory variables is easy to derive and understand. But how should categorical variables be considered?

Consider the example of investigating the relationship of weight to height among a population of school children and its dependence on the gender of the child. Gender is unlike height as it is not measured numerically. However, we can relate the child’s gender to a Boolean TRUE/FALSE variable by asking the question: Is the child male?

We can represent the Boolean variables numerically as $1$ if TRUE or $0$ if FALSE. This numerical representation are called indicator variables. For the school example, let $g_{i}$ be the gender indicator variable for the $i$ th child where $g_{i}=1$ if the child is male or $g_{i}=0$ if the child is female. Combining this with the height explanatory variable (denoted by $h_{i}$ ) obtains the linear predictor for the expected weight of child $i$ :

\eta_{i}=\beta_{0}+\beta_{1}h_{i}+\beta_{2}g_{i}

Exercise 6.48
Write the linear predictor for a male and a female child. Explain the effect $\beta_{2}$ has on the linear predictor.

Definition 6.3.1.

A qualitative variable that takes a finite number of non-numerical values is categorical. The values are sometimes called levels. The subspace spanned by the indicator vectors for the levels is a factor. (Sometimes, though loosely, we take factor to refer directly to the variable.)

Returning to the school example, the categorical variable gender has two levels, either male or female.

Categorical variables may consist of more than two variables, such as blood type where a patient is one of the levels $\{O,A,B,AB\}$ . The factor for this variable is defined by four indicator variables, e.g.:

b_{i}^{j}=\left\{\begin{array}[]{ll}1&\mbox{if patient $i$ has blood group $j$% ,}\\ 0&\mbox{otherwise},\end{array}\right.\quad\mathrm{for}~{}~{}j\in\{O,A,B,AB\}.

The distinction between factor and numerical variables is usually straightforward but some variables can fall in the middle. For instance, a survey on household bills may ask how many hours in a day do you use a mobile phone with options ‘0hr-1hr’, ‘1hr-2hr’, ‘2hr-3hr’ or ‘3hr+’. To add to the confusion, the data could be encoded in the database as $1$ , $2$ , $3$ , and $4$ for the four levels respectively.

Exercise 6.49
What assumptions are being made about the relationship between phone usage categories if is treated as a numerical variable?

6.3.1 Linear predictor with categorical variables

Suppose we have three categorical variables: $A$ with four levels, $B$ with 2 levels and $C$ also with two levels. Measurements from $n=6$ units were collected as follows:

\begin{array}[]{rccc}\mbox{unit}&A&B&C\\ \hline 1&A_{2}&B_{2}&C_{1}\\ 2&A_{1}&B_{2}&C_{2}\\ 3&A_{1}&B_{2}&C_{1}\\ 4&A_{2}&B_{1}&C_{2}\\ 5&A_{4}&B_{1}&C_{1}\\ 6&A_{3}&B_{1}&C_{2}\\ \end{array}

Each unit takes one and only one level of each factor. This information can be converted into indicators

\begin{array}[]{rrrccclll}\mbox{unit}&\mathbf{a}_{1}&\mathbf{a}_{2}&\mathbf{a}% _{3}&\mathbf{a}_{4}&\mathbf{b}_{1}&\mathbf{b}_{2}&\mathbf{c}_{1}&\mathbf{c}_{2% }\\ 1&0&1&0&0&0&1&1&0\\ 2&1&0&0&0&0&1&0&1\\ 3&1&0&0&0&0&1&1&0\\ 4&0&1&0&0&1&0&0&1\\ 5&0&0&0&1&1&0&1&0\\ 6&0&0&1&0&1&0&0&1\\ \end{array}

Here, $\mathbf{a}_{1}$ indicates level $A_{1}$ of factor $A$ and occurs for units 2 and 3. Also, $\mathbf{c}_{2}$ indicates level $C_{2}$ of factor $C$ and occurs on units 2, 4 and 6. It follows that the span for each categorical variable is:

\mathcal{A}=\mathrm{span}(\mathbf{a}_{1},\mathbf{a}_{2},\mathbf{a}_{3},\mathbf% {a}_{4}),\quad\mathcal{B}=\mathrm{span}(\mathbf{b}_{1},\mathbf{b}_{2}),\quad% \mathcal{C}=\mathrm{span}(\mathbf{c}_{1},\mathbf{c}_{2}).

This procedure generalizes to an arbitrary number of factors with arbitrary numbers of levels in the obvious way.

However, since each unit must fall in exactly one level for each categorical variable, we note the relationships:

\mathbf{a}_{1}+\mathbf{a}_{2}+\mathbf{a}_{3}+\mathbf{a}_{4}=\mathbf{1},\quad% \mathbf{b}_{1}+\mathbf{b}_{2}=\mathbf{1},\quad\mathbf{c}_{1}+\mathbf{c}_{2}=% \mathbf{1}.

Evidently, there is a linear relationship between the columns in the above table, e.g. $\mathbf{b}_{1}+\mathbf{b}_{2}=\mathbf{c}_{1}+\mathbf{c}_{2}$ . This is not an ideal property of the design matrix as there will be dependence between the co-efficients.

From the above relationship, we are able to write one of the levels in terms of the others for each of the categorical variables. For example, $\mathbf{a}_{1}=\mathbf{1}-\mathbf{a}_{2}-\mathbf{a}_{3}-\mathbf{a}_{4}$ . This means we are able to provide an alternative span for each categorical variable:

\mathcal{A}=\mathrm{span}(\mathbf{1},\mathbf{a}_{2},\mathbf{a}_{3},\mathbf{a}_% {4}),\quad\mathcal{B}=\mathrm{span}(\mathbf{1},\mathbf{b}_{2}),\quad\mathcal{C% }=\mathrm{span}(\mathbf{1},\mathbf{c}_{2}).

The linear predictor $\boldsymbol{\eta}$ is therefore an element of the space spanned by the sum of these subspaces:

\boldsymbol{\eta}\in\mathcal{A}+\mathcal{B}+\mathcal{C}=\mathrm{span}(\mathbf{% 1},\mathbf{a}_{2},\mathbf{a}_{3},\mathbf{a}_{4},\mathbf{b}_{2},\mathbf{c}_{2}).

These indicator vectors are linearly independent and so provide an ideal basis for defining the design matrix. For the example above, the design matrix is:

X=\left[\begin{array}[]{rrcccl}\mathbf{1}&\mathbf{a}_{2}&\mathbf{a}_{3}&% \mathbf{a}_{4}&\mathbf{b}_{2}&\mathbf{c}_{2}\\ \hline 1&1&0&0&1&0\\ 1&0&0&0&1&1\\ 1&0&0&0&1&0\\ 1&1&0&0&0&1\\ 1&0&0&1&0&0\\ 1&0&1&0&0&1\\ \end{array}\right]

Exercise 6.50
The linear predictor corresponding to the design matrix $X$ is

\boldsymbol{\eta}=\beta_{0}+\beta_{1}\mathbf{a}_{2}+\beta_{2}\mathbf{a}_{3}+% \beta_{3}\mathbf{a}_{4}+\beta_{4}\mathbf{b}_{2}+\beta_{5}\mathbf{c}_{2}.

Explain the meaning of each co-efficient.