Chapter 10: More Regression

Author

Colin Foster

Welcome to the online content for Chapter 10!

As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.

As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.

Multiple regression

Running a multiple regression in R is no more difficult than doing the simple linear regressions that we saw in the previous chapters.

Let’s begin by reading in the data set that I used near the beginning of this chapter to illustrate having more than one predictor:

We can see that there are 20 people in this data set, and we have three variables containing data on each person, in addition to the ‘person number’ variable. Height is the outcome that we want to predict, and location and age are our two predictors.

In Chapter 8, we used the formula lm(height ~ location, data=people) to do a simple linear regression. Let’s repeat that on this new data set, and use summary(…) to get more details.

We obtain an intercept (168.1691 cm), which is the height that the model predicts for someone at location 0 km. More interesting is the regression coefficient for location, which is 0.4042 cm per km, telling us that for every extra km to the East the person is predicted to be 0.4042 cm taller.

However, if we look along to the final column, we see that this predictor is not statistically significant (\(p= .527\)). This same \(p\) value appears at the very bottom in a equivalent test of the entire model. The \(R^2\) is only 0.02256, corresponding to only 2% of the variance. The adjusted \(R^2\) is actually negative! So really this model with just a single predictor (location) isn’t doing anything useful.

The question is, “Can we improve the model by adding age as a second predictor?” Let’s try:

The equation has changed now from height ~ location to height ~ location + age, indicating that we’re now predicting height from both location and age.

This time, we obtain an intercept (112.6993 cm), which is the height that the model predicts for someone at location 0 km and with an age of zero. An age of zero doesn’t make much sense, and, as is often the case, the intercept isn’t really of any interest.

The other two values in the output are the slopes for each of the predictors, and these regression coefficients capture the unique effect of each predictor, in the presence of the other one. So, if someone’s location is 1 km greater than someone else’s, but their ages are the same, then the model predicts their height to be 0.3177 cm greater. Similarly, if someone’s age is 1 year greater than someone else’s, but their locations are the same, then the model predicts their height to be 1.3941 cm greater. That’s how we interpret regression coefficients in a multiple regression. They’re telling us about the unique effect of that predictor, keeping all the other predictors constant.

We can get more information by using the summary(…) function:

We get the same intercept and slope values in the ‘Estimate’ column, but we also get \(t\) tests telling us whether these parameters are significantly different from zero or not. The intercept is, but we don’t care about that. Location isn’t significant, but age is (\(p = .004\)). So, we conclude that age is a significant predictor of height, in the presence of location.

Multiple \(R^2\) is .4019, or about 40%. It’s probably safer to quote the Adjusted \(R^2\) value, which is 33.1%. Either way, this tells us that our model accounts for a lot of the variance in height. The \(F\) test at the bottom of the output tells us that the model as a whole fits significantly (\(p = .013\)).

Logistic regression

Let’s read in the data set that I used for logistic regression:

We can see that there are 20 people in this data set, and we have two variables: height and location. But this time location is coded as 0 or 1, with 0 as South and 1 as North.

Running a logistic regression is a little more complicated. We use glm (generalised linear model), rather than lm, and we need a family parameter to say that we’re using logits.

The complete code is:

The important things here are the height regression coefficient, 0.3761, the null (i.e. maximum) deviance, 27.73, and the residual deviance (of our model), 17.03. We’ll see in Chapter 11 that this is a very big decrease in deviance, so with experience we can tell that it’s going to be significant. We’ll see in Chapter 11 how to get a \(p\) value for this by using the chi-squared distribution.

For now, let’s use the summary(…) function to find out if the regression coefficient is statistically significant:

We get a \(p\) value of .015 for the height log odds of 0.3761, which we referred to in the chapter.

To convert the log odds of 0.3761 requires a bit of mathematics. We have to use the exponential function exp to ‘undo’ the log part:

This tells us that the odds ratio is 1.456593, meaning that for every extra 1 cm of height, the odds of being from the North increase by a factor of 1.456593, which is a 45.7% increase.

ANCOVA

Finally, let’s read in the data set that I used for ANCOVA:

We can run an ANOVA, on this data, but, because age is a continuous variable, R will know that it needs to do an ANCOVA.

The effect of age is taken out first, because it’s on the first line of the output. On the second line, we see the effect of location, once age has been accounted for, and the p value here is .031, as given in the chapter.

Before running an ANCOVA it’s important to check the data to ensure that all of the conditions for an ANCOVA are satisfied. I haven’t discussed the details of this in the book.