Home page for accesible maths 1.6 Examining numerical data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.6.6 Robust statistics

How are the sample statistics of the num_ char data set affected by the observation, 64,401? What would have happened if this email wasn’t observed? What would happen to these summary statistics if the observation at 64,401 had been even larger, say 150,000? These scenarios are plotted alongside the original data in Figure LABEL:email50NumCharDotPlotRobustEx, and sample statistics are computed under each scenario in Table 1.8.

R> dotPlot(email50[,14], at=3, pch=20, ylim=c(0.5,3.5), xlim=c(-3.5e1,151))
R> dotPlot(email50[-which.max(email50[,14]),14], at=2, pch=20, add=TRUE)
R> modemail=email50[,14];modemail[which.max(email50[,14])] = 150
R> dotPlot(modemail, at=1, pch=20, add=TRUE)
# Code to create summaries for table
R>d=email50[,14];median(d);diff(quantile(d,c(0.25,0.75)));mean(d);sd(d)
R>d=d[-which.max(d)];median(d);diff(quantile(d,c(0.25,0.75)));mean(d);sd(d)
R>median(modemail);diff(quantile(modemail,c(0.25,0.75)));mean(modemail);sd(modemail)

robust not robust
scenario median IQR x¯ s
original num_ char data 6,890 12,875 11,598 13,125
drop 66,924 observation 6,768 11,702 10,521 10,798
move 66,924 to 150,000 6,890 12,875 13,310 22,434
Table 1.8: A comparison of how the median, IQR, mean (x¯), and standard deviation (s) change when extreme observations are present.
Example 1.6.17

(a) Which is more affected by extreme observations, the mean or median? Table 1.8 may be helpful. (b) Is the standard deviation or IQR more affected by extreme observations?

Answer. (a) Mean is affected more. (b) Standard deviation is affected more. Complete explanations are provided in the material following Exercise 1.6.17. The median and IQR are called robust estimates because extreme observations have little effect on their values. The mean and standard deviation are much more affected by changes in extreme observations.

Example 1.6.18

The median and IQR do not change much under the three scenarios in Table 1.8. Why might this be the case?

Answer. The median and IQR are only sensitive to numbers near Q1, the median, and Q3. Since values in these regions are relatively stable – there aren’t large jumps between observations – the median and IQR estimates are also quite stable.

Example 1.6.19

The distribution of vehicle prices tends to be right skewed, with a few luxury and sports cars lingering out into the right tail. If you were searching for a new car and cared about price, should you be more interested in the mean or median price of vehicles sold, assuming you are in the market for a regular car?

Answer. Buyers of a ‘‘regular car’’ should be concerned about the median price. High-end car sales can drastically inflate the mean price while the median will be more robust to the influence of those sales.