2 Exploratory Data Analysis

2.1 Numerical and graphical summaries

The easiest approach to investigating the structure and relationship between variables is to evaluate simple numerical summaries and graphical images. The aim is to quantify the four key features of the distribution of the data:

  1. 1.

    What values best represent the local concentration?

  2. 2.

    How disperse is the data?

  3. 3.

    What is the shape of the data?

  4. 4.

    How does one variable relate to another.

Simple summary statistics such as the sample mean, sample variance, median, range and quartiles are useful for investigating the location and dispersion of continuous variables. These summary statistics are not applicable for categorical variables, but tabulation provides the best tool to identify the most frequent event, i.e. the mode.

Some of the key summary measures are evaluated in R by the command summary(). For example, the summary for the birthweight data set is:

birthweight <- read.table("birthweight.dat")
summary(birthweight)

     weight          age        sex
 Min.   :2412   Min.   :35.00   F:12
 1st Qu.:2785   1st Qu.:37.00   M:12
 Median :2952   Median :38.50
 Mean   :2968   Mean   :38.54
 3rd Qu.:3184   3rd Qu.:40.00
 Max.   :3473   Max.   :42.00
Figure 2.1: Link, Caption: Left: histogram of birthweight in grams. Centre: histogram of gestational age. Right: Bar-chart of baby’s gender

Graphical representations of data using histograms and bar charts helps in exploring the shape of the data and to identify unusual features that will be of high importance for a proposed model to explain. For example, Figure 2.1 (Link) presents the histogram for weight and age for the birthwight data, along with a bar chart for gender. Here we see that the distribution for weight appears to be bi-modal; one mode at around the mean of 2968 g and a second minor mode at about 3300 g. Because of the small data set, it is difficult to tell whether this is a feature of the data, which the statistical model should incorporate, or due to natural sampling variability.

Unusual features in the data at this stage may require further investigation as to whether there are error in the data or potential outliers that may have undesirable effects on the model and parameter estimates.

Scatter plots, such as those illustrated in Chapter 1, provide a lot of information about the relationship between two numerical variables. Using additional plotting features such as point type and colour is useful to see how the dependence varies with respect to a third variable. However, adding too many features to the plot can impede interpretability.

The extent to which two continuous variables are related can be measured by their correlation. However, this is a measure of linear association and so it is possible for two variables to be uncorrelated, but possess some unusual form of relationship. For example, the correlation in Figure 2.2 (Link) is approximately zero, but it is clear that these variables are not independent.

Figure 2.2: Link, Caption: Left: Scatter plot of two associated random variables, but are uncorrelated. Centre: Sequence of box-plots depicting the association between the numerical response variable and a categorical explanatory variable. Right: Box-plots illustrating potential co-linearity between explanatory variables.

Scatter plots are limited in describing the relationship between a numerical and a categorical variable as there can be many points that are drawn on top of one another. A sequence of box-plots drawn for each categorical event provides a clearer description of the relationship. For example, the sequence of box-plots in Figure 2.2 (Link) illustrates how the dependent variable for the second group is different to the first and third group which they themselves appear to be similarly distributed.

Examining the relationship between the dependent and explanatory variables are useful to identify which explanatory variables are likely to be important for a statistical model to describe the variability in the responses. It is equally as important to investigate how the explanatory variables relate to one-another in order to be aware of potential collinearity problems. This is certainly the case in Figure 2.2 (Link) where there is a clear relationship between the ‘group’ and ‘x’ explanatory variables.

2.1.1 Dependence between categorical variables

The methods for exploring data discussed so far are most applicable when at least one of the variables you are examining is numerical. However, graphical illustrations for examining potential dependencies between two categorical variables are not useful as, for example, a scatter plot will only show a point at each distinct pair of events with multiple occurrences being drawn on top of one another.

The relationship between two categorical variables are best examined using a contingency table, or cross tabulation, that displays the frequency between all possible pairs of events from the two categorical variables of interest.

X=1 X=2 X=I Total
Y=1 n11 n21 nI1 r1
Y=2 n12 n22 nI2 r2
Y=J n1J n2J nIJ rJ
Total c1 c2 cI N

Here, nij denotes the frequency of the pair (X,Y)=(i,j) appears in the data set that contains a total of N records. The column and row totals, denoted by ci and rj respectively, represents the marginal tables for each of the categorical variables.

For example, the data set APP contains information from 480 customers who took part in a consumer satisfaction survey regarding three smartphone applications. The contingency table below present the consumer’s responses to the question “Would you recommend the app to your friends?” against which of the three apps that they have downloaded.

App1 App2 App3 Total
Yes 75 94 121 290
No 21 50 119 190
Total 96 144 240 480
Table 2.1: Contingency table of the app consumer satisfaction survey.

Here, the response ‘Yes’ is more frequent across all three apps, and the frequency increases from App1 to App2 and again for App3 for both Yes/No responses. Since the marginal relationship between events for each categorical variable is reflected within the contingency table, is this sufficient to state that there is no relationship between these categorical variables?

It is not sufficient to compare values based on magnitude but rather comparisons should be performed on proportions. For example, about half of the users of App3 responded ‘Yes’ to the question whereas about three-quarters of App1 users gave the same response. Further investigation is needed to distinguish whether this difference in proportions represents some form of dependency or natural sampling variation.

Recall that if two random variables are independent, the joint probability mass/density function can be expressed as the product of the marginal probability mass/density function. For the general contingency table, the probability of a record possesses the event pair (i,j), denoted by pij, can be written under the assumption of independence as:

pij=pipj,fori={1,,I}andj{1,,J}

where {pi} and {pj} are the marginal probabilities for each categorical variable. The true probabilities are unknown, but the proportion of the column/row totals relative to the overall total provides unbiased estimates.

p^i=ciNandp^j=rjN

Given these marginal probabilities, we can then estimate the expected number of samples to appear in each cell of the contingency table under the assumption of independence by:

Ei,j=Npij=NpipjNp^ip^j=NciNrjN=cirjN.

 
Exercise 2.8
For the app example, calculate the expected number of consumers there should be for each application and Yes/No pair under the assumption of independence.

 

2.1.2 Chi-squared test for dependence

Now that we know the expected number of events for each event pair (X,Y) under the assumption of independence, we need to assess whether the differences between the observed and expected tables are truly an indication of some relationship or sampling variation. For this, we use the Pearson’s chi-squared test for independence.

Definition 2.1.1.

For the chi-squared test for independence, the test statistic:

X2=i=1Ij=1J(ni,j-Ei,j)2Ei,j.

has an approximate χ(I-1)(J-1)2 distribution.

Informal proof: Consider the simplest table with I=1 and J=2 where the table consists of only two cells. Let n be the number of events in the first cell and N denote the overall total number or records, so that there are N-n events for the second cell. It is clear that the distribution for the count in the first cell is nBinom(N,p) for some unknown probability p. The expected number for the first cell is E1=Np whist that for the second cell is E2=N(1-p). Substituting these values into the chi-squared test statistic gives:

X2 =(n-Np)2Np+(N-n-N(1-p))2N(1-p)
=(n-Np)2Np+(-n+Np)2N(1-p)
=(1-p)(n-Np)2+p(n-Np)2Np(1-p)
=(n-Np)2Np(1-p)
=(n-𝔼[n]var(n))2

For large N, the distribution for n is approximately normal N(np,np(1-p)). Thus X2χ12. When generalising to a I×J table, the distribution of the test statistic corresponds a sum of chi-squares, but we must take into account the fact that the last row and column of the table can be derived from the marginal totals and the values in the other cells for that row or column. Therefore, the degree-of-freedom for a general table is (I-1)(J-1).

 
Exercise 2.9
Perform a chi-squared test for independence for the app study. What do you conclude?

 

When using the Pearson’s chi-squared test, it is important to be aware of the following assumptions.

  • The data are assumed independent of each other and are drawn from a population where every member has an equal probability of selection.

  • The overall sample size N must be sufficiently large such that the count in any one cell is not too small.

  • The expected count for any cell must not be zero or too small.