1 Introduction

1.4 What is a GLM?

We have discussed some statistical models that have a similar structure for describing the relationship between a single random variable, a response, and one or more explanatory variables, the covariates, which may or may not be random. These are collectively known as generalized linear models, or GLMs.

Generalized linear modelling is a development of the linear regression models to accommodate both non-normal response distributions and transformations to linearity in a straightforward way. These assumptions are loose enough to encompass a wide class of models but tight enough to allow the development of a unified methodology of estimation and inference.

A generalized linear model, GLM, is defined by the following assumptions:

  1. 1.

    Observations are taken on a one-dimensional response variable Y indexed by i, i=1,,n, together with values of explanatory variables x1,i,,xp,i for p<n.

  2. 2.

    The responses Yi, i=1,,n, are realisations of random variables which are observed independently.

  3. 3.

    The conditional distribution of Y is a member of the exponential family, EF, with mean μ and fixed (known) scale parameter ϕ. The conditioning is on the observed values of the explanatory variables.

  4. 4.

    The explanatory variables influence the distribution Y through a single linear function called the linear predictor:

    η=β1x1+β2x2++βpxp.

    Hence the covariate xj has no influence on the distribution of Y if and only if βj=0.

  5. 5.

    The mean, μi of Yi, and the linear predictor are related by a smooth invertible function g() called the link function. For the simple linear regression model, g is the identity link function:

    g(μ)=η.

1.4.1 A graphical representation of a GLM

Below is a diagram that succinctly represents the many concepts associated with a GLM:

Unnumbered Figure: Link

The essential causal idea is that changes in the explanatory variables in x lead to changes in the response Y. The relationship depends on fixed values of the parameter β, the regression coefficients. The relationship is mediated through the linear predictor, η, which is one-dimensional and aggregates the separate explanatory variables in x. The linear predictor has a one-to-one mapping with the mean response μ through the link function so that changes in x affect Y only through changing the μ parameter. The specific relationship between μ and Y depends on the particular distribution function.

We later return to these assumptions and explore them in detail.

1.4.2 Stages in modelling

Usual stages involved when developing a statistical model are:

  1. 1.

    Initial formulation – This may be based on scientific assumptions about the real-world process, but often also based on exploratory plots of data. Also known as model specification.

  2. 2.

    Model fitting – Parameter estimation, usually by maximising the likelihood.

  3. 3.

    Variable selection – Deciding which variables to include in the explanatory part of the model.

  4. 4.

    Model checking – Assessing the goodness of fit, for instance, by investigating the residuals from the model.

  5. 5.

    Re-formulation – If model checking indicates that some assumptions is invalid, then an amendment to the model may be required.

  6. 6.

    Interpretation – Interpret the estimated parameters in the fitted model within the context of the data.

Simplicity is a desirable feature of a model. We want a model that is as simple as possible. The simpler the model is, the easier the model is to understand and draw inference. This is often known as parsimony. A parsimonious model gives better predictions than one that is unnecessarily complicated.

Remember: All models are wrong, but some models are more useful than others.