Chapter 3: The Normal Distribution
Welcome to the online content for Chapter 3!
As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.
As before, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.
Tail percentages
It’s easy to use R to find the percentages of people in the tails of any Normal curve. For example, suppose that we have our usual Normal distribution, with a mean of 170 cm and a standard deviation of 6 cm. We might want to know what percentage of people are shorter than 170 cm. The answer should be obvious, because if 170 cm is the mean of a Normal distribution, then it’s also the median of the distribution. (Remember that for any Normal curve the mean, median and mode are all equal.) This means that there must be 50% of people shorter than 170 cm - and 50% taller than 170 cm.
We can use the function pnorm
to find probabilities (or percentages) inside Normal curves.
Try this:
The first number given to the pnorm
function is the value that we’re interested in. The other two numbers are the mean and the standard deviation, which tell R which Normal curve we want to know about.
The output of 0.5 is equivalent to 50%. It’s telling us that 50% of the people are shorter than 170 cm.
When you get very familiar with certain functions, you begin to get used to the order in which parameters are given, and then you might choose to omit the “mean=” and “sd=” parts. You could just write:
But the advantage of including “mean=” and “sd=” is that you’re then free to supply those parameters in whatever order you like. For example, this will work:
By including “sd=” and “mean=”, you’ve made it perfectly clear what you intend. But if you just put pnorm(170, 6, 170)
then R would assume that you meant a mean of 6 and a standard deviation of 170, and you’d get a very different result!
So, it’s safer to include the “mean=” and “sd=” parts unless you’re absolutely sure that you remember what order R will assume if you don’t!
Now, if we change the first 170 cm to 168 cm, predict what will happen to the answer.
Now try it:
The answer is smaller now, because 168 cm is a lower height, so fewer people will be shorter than 168 cm than are shorter than 170 cm. The output tells us that about 36.9% of people are shorter than 168 cm. It follows that about 13.1% of people must have heights between 168 cm and 170 cm, because 36.9% and 13.1% add up to 50%.
Let’s try 164 cm, still with the same Normal distribution:
The percentage has reduced again, because 164 cm is an even shorter height.
Because 164 cm is 6 cm (1 standard deviation) below the mean of 170 cm, the output here is telling us the percentage of people who are shorter than 1 standard deviation below the mean.
I said in the chapter that about 68% of people are within 1 standard deviation of the mean. We can get that value by working out:
This is about 68%.
This looks complicated, but it makes sense. The height of 176 cm is 1 standard deviation above the mean, and so the first part of the calculation works out the percentage of people who are shorter than 176 cm. Then, R subtracts the value that we worked out above, which is the percentage of people who are shorter than 164 cm. The final answer will be the percentage of people who are shorter than 1 standard deviation above the mean but taller than 1 standard deviation below the mean. This is what we mean by the percentage of people within 1 standard deviation of the mean.
If we try a different Normal distribution, with different values of the mean and standard deviation, we’ll still get the same answers if we choose values that are 1 standard deviation away from the mean, using the new values. For example, the code below is still working out the percentage of people shorter than 1 standard deviation below the mean, and so we will still get the same 0.1586553 value that we got above.
The standard deviation is now 5, and the mean is 85, so 80 is still 1 standard deviation below the mean, and so the percentage of people below this will still be 15.9%.
Experiment with changing the numbers in the code below and predict each time whether the output value will increase or decrease.
Box plots
For this chapter, I also want to show you how easy it is to draw box plots in R.
Let’s read in the same data set that we used in the previous two chapters.
If we want a box plot of the heights, all we have to do is:
It’s very useful to look at box plots of your data before doing any analyses. It gives you a sense of what the values are and helps you to notice outliers.
Quantiles and quartiles
Remember that the quartiles of our data are shown in a box plot by the positions of the horizontal lines. You’ll recall that the 2nd quartile is the same as the median, because it’s the value that’s half way (or two quarters of the way) through the data.
We can work out the quartiles using the quantile
function:
Quartiles are quantiles in which the data have been split into four equally-populated groups. In R, quartiles are the default for the quantile
function, because we so often use quartiles. The 25%, 50% and 75% percentiles given are equivalent to the first, second and third quartiles. The 0% percentile is the minimum value in the data set and the 100% percentile is the maximum value in the data set.
If we just wanted one of the quartiles, say the 3rd one, we could do
And we just get the 173.3 cm value that we obtained for the 75% quantile above.
We can check that the 2nd (or 50%) quartile is the same as the median and that the 0% and 100% quantiles are the minimum and maximum values respectively.
All three values here match those we obtained above from the quantile function.
Although quartiles are the default quantile in R, the quantile function will give you any kind of quantile that you want. For example, if you wanted to know what height 12% of people were smaller than, you would do:
because .12 is the same as 12%. So, 12% of the people in our data set are shorter than 159.8 cm.