A box plot summarizes a data set using five statistics while also plotting unusual observations. Figure LABEL:boxPlotLayoutNumVar provides a vertical dot plot alongside a box plot of the num_ char variable from the email50 data set.
R> boxplot(email50[,14], xlim=c(0.6,1.3))
R> points(rep(0.7, 25), rev(sort(email50[,14]))[1:25], pch=1,col=’blue’)
R> points(rep(0.7, 25), sort(email50[,14])[1:25], pch="-",col=’red’)
summary(email50[,14])
The first step in building a box plot is drawing a dark line denoting the median, which splits the data in half. Figure LABEL:boxPlotLayoutNumVar shows 50% of the data falling below the median (dashes) and other 50% falling above the median (open circles). There are 50 character counts in the data set (an even number) so the data are perfectly split into two groups of 25. We take the median in this case to be the average of the two observations closest to the percentile: . When there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in this case that observation is the median (no average needed).
Median: the number in the middle
If the data are ordered from smallest to largest, the median is the observation right in the
middle. If there are an even number of observations, there will be two values in the middle, and
the median is taken as their average.
The second step in building a box plot is drawing a rectangle to represent the middle 50% of the data. The total length of the box, shown vertically in Figure LABEL:boxPlotLayoutNumVar, is called the interquartile range (IQR, for short). It, like the standard deviation, is a measure of variability in data. The more variable the data, the larger the standard deviation and IQR. The two boundaries of the box are called the first quartile (the percentile, i.e. 25% of the data fall below this value) and the third quartile (the percentile), and these are often labelled and , respectively.
Interquartile range (IQR)
The IQR is the length of the box in a box plot. It is computed as
where and are the and percentiles.
What percent of the data fall between and the median? What percent is between the median and ?
Answer. Since and capture the middle 50% of the data and the median splits the data in the middle, 25% of the data fall between and the median, and another 25% falls between the median and . Extending out from the box, the whiskers attempt to capture the data outside of the box, however, their reach is never allowed to be more than .1515While the choice of exactly 1.5 is arbitrary, it is the most commonly used value for box plots. They capture everything within this reach. In Figure LABEL:boxPlotLayoutNumVar, the upper whisker does not extend to the last three points, which is beyond , and so it extends only to the last point below this limit. The lower whisker stops at the lowest value, 33, since there is no additional data to reach; the lower whisker’s limit is not shown in the figure because the plot does not extend down to . In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data.
Any observation that lies beyond the whiskers is labelled with a dot. The purpose of labelling these points – instead of just extending the whiskers to the minimum and maximum observed values – is to help identify any observations that appear to be unusually distant from the rest of the data. Unusually distant observations are called outliers. In this case, it would be reasonable to classify the emails with character counts of 41,623, 42,793, and 64,401 as outliers since they are numerically distant from most of the data.
Outliers are extreme
An outlier is an observation that appears extreme relative to the rest of the data.
The observation 64,401, a suspected outlier, was found to be an accurate observation. What would such an observation suggest about the nature of character counts in emails?
Answer. That occasionally there may be very long emails.
TIP: Why it is important to look for outliers
Examination of data for possible outliers serves many useful purposes, including
1.
Identifying strong skew in the distribution.
2.
Identifying data collection or entry errors. For instance, we re-examined the email purported
to have 64,401 characters to ensure this value was accurate.
3.
Providing insight into interesting properties of the data.
Using Figure LABEL:boxPlotLayoutNumVar, estimate the following values for num_ char in the email50 data set: (a) , (b) , and (c) IQR.
Answer. These visual estimates will vary a little from one person to the next: 3,000, 15,000, 12,000. (The true values: 2,536, 15,411, 12,875.)
R> out=summary(email50[,14])
R> diff(quantile(email50[,14], c(0.25,0.75)))