A parametric statistical model for a set of data consists of a probability distribution with unknown parameters. It allows us to use sample data to estimate characteristics of the population from which the data was sampled.
Let be a data sample from a population. A simple statistical model for this data would be to assume that the data can be modelled as realisations of independent and identically distributed (IID) random variables . In other words
Independence
Knowing that does not alter the probabilities of the possible outcomes for each of the remaining variables.
Identically distributed
Each of follow a common distribution, with c.d.f. .
What does this mean in practice?
First we must select an appropriate probability distribution for ; this will have one or more unknown parameters . Next we must estimate the unknown parameter(s). We denote an estimate using ‘hat’ notation, i.e. .
Having estimated the parameters, we should assess model fit: does the observed sample look like an independent random sample from the distribution with parameter ? Only then can we use the model to describe the behaviour of the population, test hypotheses or make predictions.
An important measure of potential climate change is the fluctuation in the Arctic sea ice extent. The sea ice extent varies seasonally through the year, with the annual maximum occuring in March, and the annual minimum occurring in September. Scientists are especially concerned that the annual minimum is gradually in decline, leading to a potential ice-free summer. Figure 2.2 shows both a time-series plot and a histogram of the minimum sea ice extent (in millions of ) from 1979–201011Data obtained from the National Snow and Ice Data Center http://nsidc.org/.
Before we go any further, the next stage is always to look at the data through exploratory analysis.
First load the arctic.Rdata file into R; this contains a data frame minSeaIce which has two columns
We can examine the data using a time series plot and a histogram:
The sea ice data is shown in Figure 2.2.
What is a sensible model for these data?
Denote the observations as , where is the number of observations.
Assume that are observations of an IID sequence of random variables , with distribution.
Informally, the data are a random sample from a distribution. In this example the unknown parameter is a vector, .
Questions that we might ask:
What do the data tell us about the values of and ?
If we had more data, how would our estimates of and change?
Are the modelling assumptions sensible? i.e. Are the data independent? Is the Normal distribution appropriate?
How likely are we to see an ice-free summer, i.e. a sea ice extent of zero, at some point in the future?
Is there evidence of an increasing (or decreasing) time trend in the sea ice extent?
The first two questions relate to inference, the third and fifth to modelling and the fourth to prediction.