Home page for accesible maths 1.6 Examining numerical data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.6.3 Histograms and shape

Dot plots show the exact value for each observation. This is useful for small data sets, but they can become hard to read with larger samples. Rather than showing the value of each observation, we prefer to think of the value as belonging to a bin. For example, in the email50 data set, we create a table of counts for the number of cases with character counts between 0 and 5,000, then the number of cases between 5,000 and 10,000, and so on. Observations that fall on the boundary of a bin (e.g. 5,000) are allocated to the lower bin. This tabulation is shown in Table 1.7. These binned counts are plotted as bars in Figure LABEL:email50NumCharHist into what is called a histogram, which resembles the stacked dot plot shown in Figure LABEL:emailCharactersDotPlotStacked.

R> hist(email50[,14], breaks=12, ylim=c(0, 23))

Characters
(in thousands) 0-5 5-10 10-15 15-20 20-25 25-30 55-60 60-65
Count 19 12 6 2 3 5 0 1
Table 1.7: The counts for the binned num_ char data.

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. For instance, there are many more emails with fewer than 20,000 characters than emails with at least 20,000 in the data set. The bars make it easy to see how the density of the data changes relative to the number of characters.

Histograms are especially convenient for describing the shape of the data distribution. Figure LABEL:email50NumCharHist shows that most emails have a relatively small number of characters, while fewer emails have a very large number of characters. When data trail off to the right in this way and have a longer right tail, the shape is said to be right skewed.1313Other ways to describe data that are skewed to the right: skewed to the right, skewed to the high end, or skewed to the positive end.

Data sets with the reverse characteristic – a long, thin tail to the left – are said to be left skewed. We also say that such a distribution has a long left tail. Data sets that show roughly equal trailing off in both directions are called symmetric.



Long tails to identify skew When data trail off in one direction, the distribution has a long tail. If a distribution has a long left tail, it is left skewed. If a distribution has a long right tail, it is right skewed.

Example 1.6.8

Take a look at the dot plots in Figures LABEL:emailCharactersDotPlot and LABEL:emailCharactersDotPlotStacked. Can you see the skew in the data? Is it easier to see the skew in this histogram (Figure LABEL:email50NumCharHist) or the dot plots?

Answer. The skew is visible in all three plots, though the flat dot plot is the least useful. The stacked dot plot and histogram are helpful visualizations for identifying skew.

Example 1.6.9

Besides the mean (since it was labelled), what can you see in the dot plots that you cannot see in the histogram?

Answer. Character counts for individual emails. In addition to looking at whether a distribution is skewed or symmetric, histograms can be used to identify modes. A mode is represented by a prominent peak in the distribution. Another definition of mode, which is not typically used in statistics, is the value with the most occurrences. Especially in continuous data, it is common to have no observations with the same value in a data set, which makes this definition useless for many real data sets. There is only one prominent peak in the histogram of num_ char.

Figure LABEL:singleBiMultiModalPlots shows histograms that have one, two, or three prominent peaks. Such distributions are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than 2 prominent peaks is called multimodal. Notice that there was one prominent peak in the unimodal distribution with a second less prominent peak that was not counted since it only differs from its neighbouring bins by a few observations.

R>set.seed(51)
R>x1=rchisq(65, 6)
R>x2=c(rchisq(22, 5.8), rnorm(40, 16.5, 2))
R>x3=c(rchisq(20, 3), rnorm(35, 12), rnorm(42, 18, 1.5))
R>hist(x1);hist(x2);hist(x3)

Example 1.6.10

Figure LABEL:email50NumCharHist reveals only one prominent mode in the number of characters. Is the distribution unimodal, bimodal, or multimodal?

Answer. Unimodal. Remember that uni stands for 1 (think unicycles). Similarly, bi stands for 2 (think bicycles). (We’re hoping a multicycle will be invented to complete this analogy.)

Example 1.6.11

Height measurements of young students and adult teachers at a primary school were taken. How many modes would you anticipate in this height data set?

Answer. There might be two height groups visible in the data set: one of the students and one of the adults. That is, the data are probably bimodal.



TIP: Looking for modes Looking for modes isn’t about finding a clear and correct answer about the number of modes in a distribution, which is why prominent is not rigorously defined in this course. The important part of this examination is to better understand your data and how it might be structured.