Home page for accesible maths 2 Statistical estimation and uncertainty

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

2.1 Statistical models and parameter estimation

A parametric statistical model for a set of data consists of a probability distribution with unknown parameters. It allows us to use sample data to estimate characteristics of the population from which the data was sampled.

Let x1,,xn be a data sample from a population. A simple statistical model for this data would be to assume that the data can be modelled as realisations of independent and identically distributed (IID) random variables X1,,Xn. In other words

  • Independence

    Knowing that Xi=xi does not alter the probabilities of the possible outcomes for each of the remaining variables.

  • Identically distributed

    Each of X1,,Xn follow a common distribution, with c.d.f. F.

What does this mean in practice?

First we must select an appropriate probability distribution for F; this will have one or more unknown parameters θ. Next we must estimate the unknown parameter(s). We denote an estimate using ‘hat’ notation, i.e. θ^.

Having estimated the parameters, we should assess model fit: does the observed sample look like an independent random sample from the distribution F with parameter θ^? Only then can we use the model to describe the behaviour of the population, test hypotheses or make predictions.

TheoremExample 2.1.1 Arctic sea ice

An important measure of potential climate change is the fluctuation in the Arctic sea ice extent. The sea ice extent varies seasonally through the year, with the annual maximum occuring in March, and the annual minimum occurring in September. Scientists are especially concerned that the annual minimum is gradually in decline, leading to a potential ice-free summer. Figure 2.2 shows both a time-series plot and a histogram of the minimum sea ice extent (in millions of km2) from 1979–201011Data obtained from the National Snow and Ice Data Center http://nsidc.org/.

Before we go any further, the next stage is always to look at the data through exploratory analysis.

First load the arctic.Rdata file into R; this contains a data frame minSeaIce which has two columns

> load("arctic.Rdata")
> names(minSeaIce)
[1] "Year"      "IceExtent"

We can examine the data using a time series plot and a histogram:

> plot(minSeaIce[,1],minSeaIce[,2],xlab="Year",ylab="Sea Ice Extent",type="b")
> hist(minSeaIce[,2],xlab="Sea Ice Extent",ylab="Frequency",main="")
Fig. 2.2: Time series (left) and histogram (right) of the annual minimum sea ice extent in the Arctic from 1979–2010.

The sea ice data is shown in Figure 2.2.

What is a sensible model for these data?

  • Denote the observations as x1,,xn, where n=32 is the number of observations.

  • Assume that x1,,xn are observations of an IID sequence of random variables X1,,Xn, with Normal(μ,σ2) distribution.

Informally, the data are a random sample from a Normal(μ,σ2) distribution. In this example the unknown parameter is a vector, θ=(μ,σ).

Questions that we might ask:

  • What do the data tell us about the values of μ and σ2?

  • If we had more data, how would our estimates of μ and σ2 change?

  • Are the modelling assumptions sensible? i.e. Are the data independent? Is the Normal distribution appropriate?

  • How likely are we to see an ice-free summer, i.e. a sea ice extent of zero, at some point in the future?

  • Is there evidence of an increasing (or decreasing) time trend in the sea ice extent?

The first two questions relate to inference, the third and fifth to modelling and the fourth to prediction.