1 Introduction 1.3 Motivating example 2 Exploratory Data Analysis

1.4 What is a GLM?

We have discussed some statistical models that have a similar structure for describing the relationship between a single random variable, a response, and one or more explanatory variables, the covariates, which may or may not be random. These are collectively known as generalized linear models, or GLMs.

Generalized linear modelling is a development of the linear regression models to accommodate both non-normal response distributions and transformations to linearity in a straightforward way. These assumptions are loose enough to encompass a wide class of models but tight enough to allow the development of a unified methodology of estimation and inference.

A generalized linear model, GLM, is defined by the following assumptions:

1.

Observations are taken on a one-dimensional response variable $Y$ indexed by $i$ , $i=1,\ldots,n$ , together with values of explanatory variables $x_{1,i},\ldots,x_{p,i}$ for $p<n$ .
2.

The responses $Y_{i}$ , $i=1,\ldots,n$ , are realisations of random variables which are observed independently.
3.

The conditional distribution of $Y$ is a member of the exponential family, EF, with mean $\mu$ and fixed (known) scale parameter $\phi$ . The conditioning is on the observed values of the explanatory variables.
4.

The explanatory variables influence the distribution $Y$ through a single linear function called the linear predictor:

$\displaystyle\eta=\beta_{1}x_{1}+\beta_{2}x_{2}+\ldots+\beta_{p}x_{p}.$

Hence the covariate $x_{j}$ has no influence on the distribution of $Y$ if and only if $\beta_{j}=0$ .
5.

The mean, $\mu_{i}$ of $Y_{i}$ , and the linear predictor are related by a smooth invertible function $g(\cdot)$ called the link function. For the simple linear regression model, $g$ is the identity link function:

$\displaystyle g(\mu)=\eta.$

1.4.1 A graphical representation of a GLM

Below is a diagram that succinctly represents the many concepts associated with a GLM:

Unnumbered Figure: Link

The essential causal idea is that changes in the explanatory variables in $x$ lead to changes in the response $Y$ . The relationship depends on fixed values of the parameter $\beta$ , the regression coefficients. The relationship is mediated through the linear predictor, $\eta$ , which is one-dimensional and aggregates the separate explanatory variables in $x$ . The linear predictor has a one-to-one mapping with the mean response $\mu$ through the link function so that changes in $x$ affect $Y$ only through changing the $\mu$ parameter. The specific relationship between $\mu$ and $Y$ depends on the particular distribution function.

We later return to these assumptions and explore them in detail.

1.4.2 Stages in modelling

Usual stages involved when developing a statistical model are:

1.

Initial formulation – This may be based on scientific assumptions about the real-world process, but often also based on exploratory plots of data. Also known as model specification.
2.

Model fitting – Parameter estimation, usually by maximising the likelihood.
3.

Variable selection – Deciding which variables to include in the explanatory part of the model.
4.

Model checking – Assessing the goodness of fit, for instance, by investigating the residuals from the model.
5.

Re-formulation – If model checking indicates that some assumptions is invalid, then an amendment to the model may be required.
6.

Interpretation – Interpret the estimated parameters in the fitted model within the context of the data.

Simplicity is a desirable feature of a model. We want a model that is as simple as possible. The simpler the model is, the easier the model is to understand and draw inference. This is often known as parsimony. A parsimonious model gives better predictions than one that is unnecessarily complicated.

Remember: All models are wrong, but some models are more useful than others.