Chapter 9: Correlation

Author

Colin Foster

Welcome to the online content for Chapter 9!

As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.

As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.

Drawing regression lines

Let’s begin by reading in the data set that I used for this chapter:

We can see that there are 20 people in this data set, because there are 20 rows in the data frame.

Let’s use plot to see what the data look like:

We saw in Chapter 8 how to use the function lm to find the regression line of height on location:

Here, I’ve used the lm function to do the regression, and then stored the result in an object that I’ve decided to call ‘model’.

We can add the regression line to the plot by using the abline command, which draws a straight line.

If we feel artistic, we can make the regression line any colour of our choice, by setting the col parameter:

This is a useful thing to do, because I want to draw a different regression line next, and not muddle up the two.

Reversing the variables

We saw in the chapter that predicting height from location (height ~ location) is a different problem from predicting location from height (location ~ height). Let’s run the location ~ height regression:

We can see that we get completely different values. This time the intercept (-145.5322) is telling us the location that we’d predict for someone with zero height, which isn’t a meaningful concept, because no one can have zero height. The slope (0.8883) is telling us how much on average the model predicts each person’s location (in km) to increase by when their height goes up by 1 cm.

If we replace height ~ location with location ~ height, we can display the regression line of location on height. I’ve put "orange" for the colour, so we don’t confuse it with the purple line above.

Notice that height is going horizontally now, and location is vertical. It’s just conventional to put the variable that’s being regressed on the vertical axis.

Standardising the variables

We saw in the chapter how these two regression lines coincide if we standardise the variables. To standardise each variable, we find out how far each person’s value is from the mean, and express this in terms of standard deviations.

Let’s see how each person’s height differs from the mean height:

These are the residuals that I normally draw in blue in the book.

To see how many standard deviations each of these residuals is, we have to divide by the standard deviation:

Let’s call this new variable ‘standardised.height’.

Remember that the <- part in the code assigns the value of the variable on the right to the variable on the left.

We can do the same thing to create ‘standardised.location’, and we can plot them against each other.

If we use lm on these variables, the regression coefficient that we get will be equal to the correlation coefficient \(r\).

In the output, the slope here is given as 6.783e-01, which is just standard form notation for 0.6783. So, \(r = .68\), correct to 2 decimal places.

Let’s check that we get the same result doing the regression the other way round:

The intercept comes out different, of course, but the slope for this regression is exactly the same, 6.783e-01, or about .68. The \(p\) value for the slope is also the same in both cases (.001).

Correlation

The coding above was just to reproduce the argument given in the text. In reality, we can obtain correlation coefficients much more easily than this by using the cor function on our original variables ‘height’ and ‘location’:

We get exactly the same value of 0.6783… as we did above.

Because correlation is symmetrical, we must get the same result if we list the variables in the opposite order:

We could have used standardised variables, and we’d have got exactly the same answer, but there’s no need to standardise the variables if you want to find a correlation.

We can square our \(r\) value, to find out the percentage of shared variance between the two variables:

The output tells us that 46% of the variance in these variables is shared. Another way to say this is that location accounts for 46% of the variance in height and height accounts for 46% of the variance in location.