Chapter 7: Analysis of Variance (ANOVA)

Author

Colin Foster

Welcome to the online content for Chapter 7!

As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.

As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.

Variance ratio test

In this chapter, we began with two data sets with equal means but different variances. Let’s read in those data sets and call the dataframes ‘first’ and ‘second’.

Let’s check that their means are equal:

Those values are very close to equal.

You might have noticed that there were 25 people in the first data set but only 20 in the second one. We can confirm that using the length function in R.

Let’s see what the variances are of each sample:

We can work out the variance ratio by dividing the first variance by the second variance:

You may recall that I contrived the variances to have a ratio of 4 to make this first example easier to present.This gives us an \(F\) statistic of 4.

Running a variance ratio test on these two samples is easy:

We get output a bit similar to the \(t\) test output that we saw in the previous chapter.

First, R tells us ‘F test to compare two variances’, so we know that it’s running the correct test.

The ‘data:’ line then tells us the two variables containing the data that we’re comparing.

The \(F\) value of 3.9961 matches our calculation of the \(F\) ratio, and R gives us the ‘numerator’ (first variance) and ‘denominator’ (second variance) degrees of freedom, which are 1 fewer than the number of observations in each variable (\(25-1\) and \(24-1\)).

The \(p\) value matches the .003 that I stated in the text, and the confidence interval tells us all of the possible population variance ratios that our \(F\) value of 4 wouldn’t be able to reject. The fact that this 95% confidence interval doesn’t include 1 (equal variances) corresponds to the \(p\) value being less than 5%.

Using ANOVA to do a \(\boldsymbol t\) test

You’ll recall that in the chapter I showed how ANOVA could replicate the results of an independent-samples \(t\) test. This was just to show the logic of how ANOVA works.

Let’s replicate that here.

First, we’ll read in the data set and call it ‘people’ again.

We can see that we have the heights of 20 people from the West and 20 people from the East. This time, I’ve numbered the Westerners 1-20 and the Easterners 1-20. I could have numbered the Easterners 21-40 - it wouldn’t have mattered, because they’re different people: Easterner 7 and Westerner 7, say, are completely different people.

Let’s run a \(t\) test like we did in Chapter 6.

We get a \(p\) value of .039, and all of the other values match those that I gave in the chapter.

Using the subset function to do this is a bit long-winded, and there’s actually an easier way to run this same t test in R.

We get exactly the same output as above.

The data=people parameter tells R which dataframe it should look in.

The height ~ location part is read as ‘height by location’ or ‘height predicted by location’, and the tilde ~ symbol tells R to split up the height values according to which level of the location variable each person has. It does exactly the same as the subsetting commands above, but involves less typing. From now on, where possible, we’ll use tildes to define our models.

To do this using ANOVA, we use the ANOVA function aov like this:

This gives us the sum of squares for location and its degrees of freedom (1, because 1 fewer than the 2 levels of the variable, West and East). It also gives us the same for the residuals (i.e. error), and you can see that the residual sum of squares is much larger. We’ll ignore the rest of the information for now.

From these numbers, we can calculate eta squared.

We work out the total sum of squares:

And eta squared is the model sum of squares, divided by this.

This is where the 11% comes from that I mentioned in the text.

If we want more details of the ANOVA, we have to ask for a summary of it:

The first two columns give us the same information as before, but now we additionally get the mean squares for each variable (in the ‘Mean Sq’ column) and the ratio of these two variances, which is the \(F\) value, 4.58. R also gives us the \(p\) value for this, in the ‘Pr(>F)’ column. This means ‘the probability of getting an \(F\) ratio larger than this’, if the null hypothesis is true. We can see that it’s the same \(p\) value of .039 that we obtained above in the \(t\) test.

1-Way ANOVA

Doing this for more than two locations (West, East, Island) is no more difficult.

Let’s read in the data that I used in the chapter for this, and I’ll call it ‘people’ again. This will overwrite the previous dataframe called ‘people’.

This time there are 60 people altogether, 20 in each of the three locations.

We can find the mean height in each location in the usual way:

Those values match the ones in the chapter.

Let’s run the ANOVA:

The values all match the ones in the text.

We can work out the total sum of squares:

and the eta squared value:

So, eta squared is 10%.

Repeated-measures ANOVA

We can use the same data values to run a repeated-measures ANOVA, if we assume that the data were collected differently.

This is a good opportunity to show how flexible R is if you want to make changes to your data set.

We can use the names function to change our variable name ‘location’ to ‘time’. Run this code to see what the names function does:

The first line of code asks for the variable names of all of the variables in the dataframe ‘people’.

The second line of code asks for the second variable name, which is ‘location’. Square brackets after a variable pull out a particular value of that variable.

Try this:

The first line of code assigns the name ‘time’ to the second variable name. You can see that the title for the second variable has changed from ‘location’ to time’.

We can do something similar to change all of the ‘West’ values to ‘Before’.

Here, the square brackets don’t contain a number, but are working like the subset function to pick out all of the people in the West.

Type or paste code into the empty code box below to change ‘East’ into ‘After’ and ‘Island’ into ‘Later’.

We’re now thinking of our data as coming from 20 people, whose heights were each observed three times.

To run the repeated-measures ANOVA the code is:

To make this analysis repeated-measures, apart from changing the location variable to the time variable and renaming its levels, we’ve added an extra term, Error(factor(person)/time), which tells R that we want it to block each person over time.

The output first shows the between-participants sum of squares, which is 1455. This is removed from the analysis, so, to find partial eta squared, we work out

All of these values match those discussed in the chapter.

2-way ANOVA

Finally, we can run the 2-way ANOVA, with location and age as our two factors.

Let’s read in the data for this:

We can see that there are 40 people, 20 from the West and 20 from the East. But this time, each person has not just a person number, a height and a location, but also an age (either ‘Old’ or ‘Young’).

To run a 2-way ANOVA without an interaction, the code is:

We get a significant effect for location, and the partial eta squared is:

Notice that, for partial eta squared, in the number that we divide by we don’t include the 0.6 sum of squares from age. (Because 0.6 is so much smaller than the other numbers, this makes virtually no difference in this case.)

Now, we’ll include the interaction, which we represent by multiplying location and age together, using an asterisk as location*age.

The only significant effect now is the interaction (shown by ‘location:age’ in the output). (The location variable is significant only at the 0.1 or 10% level.)

The partial eta squared for the interaction is:

These all match the values given in the chapter.