Home page for accesible maths 1.6 Examining numerical data 1.6.6 Robust statistics 1.6.8 Mapping data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.6.7 Transforming data

When data are very strongly skewed, we sometimes transform them so they are easier to model. Consider the histogram of salaries for Major League Baseball players’ salaries from 2010, which is shown in Figure LABEL:histMLBSalariesReg.

R> data(MLB)
R> hist(MLB[,4]/1000, breaks=15)
R> hist(log(MLB[,4]/1000), breaks=15)

Example 1.6.20

The histogram of MLB player salaries is useful in that we can see the data are extremely skewed and centred (as gauged by the median) at about $1 million. What isn’t useful about this plot?

Answer. Most of the data are collected into one bin in the histogram and the data are so strongly skewed that many details in the data are obscured. There are some standard transformations that are often applied when much of the data cluster near zero (relative to the larger values in the data set) and all observations are positive. A transformation is a rescaling of the data using a function. For instance, a plot of the natural logarithm¹⁶¹⁶Statisticians often write the natural logarithm as $\log$ . You might be more familiar with it being written as $\ln$ . of player salaries results in a new histogram in Figure LABEL:histMLBSalariesLog. Transformed data are sometimes easier to work with when applying statistical models because the transformed data are much less skewed and outliers are usually less extreme.

Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of the line_ breaks and num_ char variables is shown in Figure LABEL:email50LinesCharactersMod, which was earlier shown in Figure LABEL:email50LinesCharacters. We can see a positive association between the variables and that many observations are clustered near zero. In the Math235 course, we might want to use a straight line to model the data. However, we’ll find that the data in their current state cannot be modelled very well. Figure LABEL:email50LinesCharactersModLog shows a scatterplot where both the line_ breaks and num_ char variables have been transformed using a log (base $e$ ) transformation. While there is a positive association in each plot, the transformed data show a steadier trend, which is easier to model than the untransformed data.

R> plot(email50[,14], email50[,15], pch=19)
R> plot(log(email50[,14]), log(email50[,15]), pch=19)

Transformations other than the logarithm can be useful, too. For instance, the square root ( $\sqrt{\text{original observation}}$ ) and inverse ( $\frac{1}{\text{original observation}}$ ) are used by statisticians. Common goals in transforming data are to see the data structure differently, reduce skew, assist in modelling, or straighten a nonlinear relationship in a scatterplot.