Chapter 6: The \(t\) Distribution
Welcome to the online content for Chapter 6!
As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.
As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.
Tails of Normal curves (again)
In the previous few chapters, we’ve used the pnorm
function to find the percentages inside Normal curves. For example, we worked out things like:
In this case, because the standard deviation is 6 cm, we’re finding the percentage of people who are shorter than 1 standard deviation above the mean. The output tells us that about 84% of people are shorter than 1 standard deviation above the mean.
We’d get the same result if we did:
Being 10 cm above the mean, when the standard deviation is 10 cm, is exactly as likely as being 6 cm above the mean when the standard deviation is 6 cm.
Because of this, it’s often easier to think in terms of the z statistic, which tells us the number of standard deviations above the mean that a value is. And the pnorm
function will assume that that’s what you’re giving it, if you don’t state the mean and standard deviation:
We can think of this as being the same as:
R takes a mean of zero and a standard deviation of 1 as its default setting for the pnorm
function. You’ll recall that this is the standard normal distribution.
A \(\boldsymbol z\) score of -1 means 1 standard deviation below the mean.
The code below finds the number of people shorter than 1 standard deviation below the mean:
This is a much smaller percentage, because being shorter than a lower height is always going to be less likely than being shorter than a taller height. In particular, it’s less than 50%, because we know that the percentage of people shorter than the mean must be 50%.
We can check this by doing:
This confirms that 50% of people have a height shorter than a \(z\) score of 0, corresponding to the mean.
The code below finds us the percentage of people within 1 standard deviation of the mean:
We’ve seen before that this is about 68%.
Tails of \(\boldsymbol t\) curves
There’s a corresponding pt
function to find the percentages inside \(t\) curves, and it works in exactly the same way as pnorm
, except that not all \(t\) curves are the same shape, so we have to specify the number of degrees of freedom, as well as the \(t\) value.
In the chapter, we had a sample of size 10 that had a mean of 173.4 cm and a standard deviation of 6.3 cm. Our null hypothesis was that the population mean was 170 cm.
To find the \(t\) statistic, we have to find how many standard errors from the hypothesised mean our sample mean is.
We work out how far our sample mean is from 170 cm.
We estimate the standard error from the standard deviation by working out \(6.3/\sqrt{10}\), just as we did in the previous chapter.
In the chapter, I referred to this value as being 2 cm, and we’ll use that rounded value here.
So, to find the \(t\) statistic, we just have to work out how large 3.4 is, in terms of 2s.
So, our \(t\) statistic, on 9 degrees of freedom (one fewer than the sample size of 10) is 1.70, which we write as:
\[ t(9)=1.70. \]
To get the \(p\) value that goes with this, we need to double the area inside the \(t\) curve that’s to the right of 1.70 (remembering that \(t\) curves are always symmetrical), because we’re doing a 2-tailed test.
This tells us that the \(p\) value is .123, so we can write:
\[ t(9)=1.70,p=.123. \]
Because we’ve used rounded values, this doesn’t exactly match what I presented in the chapter, but it’s close. Let’s see how to do it more accurately.
Running a 1-sample \(\boldsymbol t\) test
If what we just did seems like a lot of work, there’s actually a function in R called t.test
which does all of this for you. I wanted to show you above how to work with the \(t\) distribution directly, but in practice you’d just use t.test
to do all of this. So, let’s do it again, but using t.test
this time, and this time we should get exactly the values that I mentioned in the chapter, because we won’t be rounding the numbers.
First, let’s read in the data of the 10 people’s heights.
Let’s check that the mean is 173.4 cm.
That matches. And now we can run a 1-sample \(t\) test, with our null hypothesis being that the population mean (called mu) is 170 cm. The code is:
This is the most output that we’ve ever got from an R function, so let’s take time to unpick it.
First, there’s a heading saying that we’ve done a ‘One Sample t-test’ using the variable people$height
.
Then we get a line giving us the \(t\) statistic, the number of degrees of freedom and the \(p\) value, which we would normally write out as:
\[ t(9)=1.67,p=.129. \]
This time, it matches exactly what we had in the chapter.
We also get a 95% confidence interval, which goes from 168.8 cm to 177.9 cm. The fact that 170 cm lies within this confidence interval corresponds to the fact that the \(p\) value for our 170 cm null hypothesis is greater than 5%.
Finally, at the bottom, we’re given the sample mean value of 173.354 cm.
Running 2-sample \(\boldsymbol t\) tests
We saw in the chapter that there are two kinds of 2-sample \(t\) test: independent samples and paired-samples. I’ll show you how to do both.
First, we’ll do an independent-samples \(\boldsymbol t\) test. For this, we need two sets of height data, which means that our dataframe will contain an extra variable, telling us whether each person is in the East or in the West. Let’s read in the data that we used in the chapter.
For the first time, we have three variables in our dataframe, corresponding to three columns.
Each person has a reference number, and a height, but now also a location - West or East. We need to use the categorical ‘location’ variable to enable us to compare the mean height for the Westerners with the mean height for the Easterners.
We need to split the variable twogroups$height
into the Westerner heights and the Easterner heights. To do this we can use the subset
function in R. (Note that advanced users of R tend to avoid the subset
function, but I think it’s justified here for simplicity.)
Try this:
This gives us all of the twogroups people who have a location of ‘West’.
(Note that we use two adjacent equals signs ==
to say, ‘if it’s equal to’.)
Interpret what this code is doing:
The first line of code creates a new variable, called Westerner.heights
. (Remember that the full stop is just there to replace a space, which isn’t allowed in variable names.) The second line works out the mean height in the West.
Use the code box below to adapt the code above to find the mean height of the Easterners.
To run the \(t\) test, we just need to do:
(You could simplify this to t.test(Easterner.heights, Westerner.heights)
instead, if you defined Easterner.heights
in the empty code box above.)
This time, instead of a variable and a fixed value (e.g. mu = 170
), the two arguments that we put into t.test
are both variables - one is the heights of the Easterners and the other is the heights of the Westerners.
We have a \(p\) value of .479, which is greater than 5%, so we don’t reject the null hypothesis that these two groups are random samples from the same population. The confidence interval of the difference between the East mean and the West mean goes from -2.2 cm to 4.6 cm. This corresponds to anything between the Easterners being 4.6 cm taller than the Westerners to the Westerners being 2.2 cm taller than the Easterners (because of the negative sign).
If you put the two variables into the t.test
function in the opposite order (i.e. Westerners first, followed by a comma, and then the Easterners), you’d get almost the same output, but the confidence interval would go from -4.6 cm to 2.2 cm, instead of from -2.2 cm to 4.6 cm. That’s because it would be telling you about the difference between the East mean and the West mean, rather than the difference between the West mean and the East mean. It always compares the second variable to the first.
We saw in the chapter that if the same data had been generated in a paired fashion, from the same participants, then we would be able to reject the null hypothesis of no difference. Let’s confirm that by running a paired-samples $t$ test.
Let’s read in the data again, but differently labelled.
This time, the height values are the same as before, but there are only 20 people. Each person has a height ‘before’ going into space and ‘after’ going into space. Notice this time that the person numbers go 1 to 20, and then 1 to 20 again, rather than 21 to 40.
We can subset the twotimes data, similarly to the subsetting that we did above:
This pulls out everybody’s ‘Before’ heights and ‘After’ heights separately.
To run the paired \(t\) test, we just need to do:
We definitely want to put after.heights
first and before.heights
second, because we want to compare ‘after’ with ‘before’, rather than ‘before’ with ‘after’!
The argument paired=TRUE
is telling R that we want it to treat this data as paired data and run a paired \(t\) test. To confirm this, the R output begins ‘Paired t-test’.
This time, we get a \(p\) value of .002, which is less than 5%, so we do reject the null hypothesis that there’s no difference between the Before and After heights. The confidence interval of the difference between the After mean and the Before mean goes from 0.5 cm to 1.9 cm, which doesn’t include zero. A zero change in height is rejected, as is any change in height less than 0.5 cm.