Statistics for Laboratory Scientists ( 140.615 )

Multiple Linear Regression - Advanced

We are using some data stored in the R package SPH.140.615.

library(SPH.140.615)

Example: detecting a treatment effect

Imagine we have three different treatments, and we are interested whether treatment has an effect on a response. In this example, the response might also depend on age.

head(dat.anc1)
##       rsp trt age
## 1 101.747   A  36
## 2  96.697   A  48
## 3  96.557   A  48
## 4  98.311   A  42
## 5 104.293   A  30
## 6  98.865   A  43
summary(dat.anc1)
##       rsp         trt         age       
##  Min.   : 95.90   A:12   Min.   :27.00  
##  1st Qu.: 98.37   B:12   1st Qu.:37.00  
##  Median :100.42   C:12   Median :42.00  
##  Mean   :100.56          Mean   :40.53  
##  3rd Qu.:102.53          3rd Qu.:44.25  
##  Max.   :107.69          Max.   :49.00

Plot the response across the treatment levels.

boxplot(split(dat.anc1$rsp,dat.anc1$trt), col=c("red","blue","green3"), ylab="response")

The average response in treatment C seems to be a bit higher than in the two other groups. Let’s carry out an ANOVA.

aov.fit <- aov(rsp ~ trt, data=dat.anc1)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## trt          2  44.01  22.007   2.958 0.0658 .
## Residuals   33 245.54   7.441                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are not significant. However, age is strongly related to the response, but not to treatment.

plot(dat.anc1$age, dat.anc1$rsp, pch=21, bg=c("red","blue","green3")[dat.anc1$trt], xlab="age", ylab="response")

boxplot(split(dat.anc1$age,dat.anc1$trt), col=c("red","blue","green3"), ylab="age")

Let’s look at the residuals after regressing out age.

dat.anc1$res <- residuals(lm(rsp ~ age, data=dat.anc1))
boxplot(split(dat.anc1$res,dat.anc1$trt), col=c("red","blue","green3"), ylab="residuals")

The differences in means are now much more pronounced. Let’s carry out an ANOVA on the residuals.

aov.fit <- aov(res ~ trt, data=dat.anc1)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## trt          2  11.12   5.558   5.631 0.00787 **
## Residuals   33  32.58   0.987                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are very significant. Let’s analyze the data in one wash-up using an ANCOVA.

ancova.fit <- lm(rsp ~ trt + age, data=dat.anc1)
summary(ancova.fit)
## 
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7766 -0.6818  0.1645  0.4679  2.1091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.49624    1.41098  84.690  < 2e-16 ***
## trtB          0.67656    0.40923   1.653  0.10806    
## trtC          1.42126    0.41768   3.403  0.00181 ** 
## age          -0.48461    0.03321 -14.591 1.08e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 32 degrees of freedom
## Multiple R-squared:  0.8892, Adjusted R-squared:  0.8788 
## F-statistic:  85.6 on 3 and 32 DF,  p-value: 2.255e-15
ancova.fit.red <- lm(rsp ~ age, data=dat.anc1)
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
## 
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     34 43.693                                
## 2     32 32.082  2    11.611 5.7907 0.007138 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary: age explained a lot of variability in the response, so by including age in the linear model the residual variability was much smaller, and we had more power to detect the treatment effect.

Example: confounding

Same context as above.

head(dat.anc2)
##       rsp trt age
## 1 105.138   A  30
## 2 104.881   A  30
## 3 107.198   A  26
## 4 100.931   A  36
## 5 102.197   A  34
## 6 104.886   A  28
summary(dat.anc2)
##       rsp         trt         age       
##  Min.   : 87.91   A:12   Min.   :26.00  
##  1st Qu.: 96.31   B:12   1st Qu.:31.75  
##  Median :100.86   C:12   Median :37.00  
##  Mean   : 99.64          Mean   :40.42  
##  3rd Qu.:103.76          3rd Qu.:50.00  
##  Max.   :108.08          Max.   :62.00

Plot the response across the treatment levels.

boxplot(split(dat.anc2$rsp,dat.anc2$trt), col=c("red","blue","green3"), ylab="response")

Looks like the response is vastly different across treatment groups. Let’s carry out an ANOVA.

aov.fit <- aov(rsp ~ trt, data=dat.anc2)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## trt          2  701.1   350.6   39.97 1.53e-09 ***
## Residuals   33  289.4     8.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are highly significant. However, age is again strongly related to the response, and in this example, age is also related to treatment.

plot(dat.anc2$age, dat.anc2$rsp, pch=21 ,bg=c("red","blue","green3")[dat.anc2$trt], xlab="age" ,ylab="response")

boxplot(split(dat.anc2$age,dat.anc2$trt), col=c("red","blue","green3"), ylab="age")

Let’s look at the residuals after regressing out age.

dat.anc2$res <- residuals(lm(rsp ~ age, data=dat.anc2))
boxplot(split(dat.anc2$res,dat.anc2$trt), col=c("red","blue","green3"), ylab="residuals")

The means in the treatment groups now look about the same. Let’s carry out an ANOVA on the residuals.

aov.fit <- aov(res ~ trt, data=dat.anc2)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value Pr(>F)
## trt          2   0.97  0.4858   0.475  0.626
## Residuals   33  33.78  1.0238

The differences in means are not significant. Let’s analyze the data in one wash-up using an ANCOVA.

ancova.fit <- lm(rsp ~ trt + age, data=dat.anc2)
summary(ancova.fit)
## 
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3195 -0.8468 -0.1598  0.7004  2.5235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120.25394    1.04030 115.596   <2e-16 ***
## trtB         -0.38867    0.49645  -0.783    0.439    
## trtC         -0.09653    0.80367  -0.120    0.905    
## age          -0.50598    0.03251 -15.562   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.027 on 32 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9627 
## F-statistic: 302.1 on 3 and 32 DF,  p-value: < 2.2e-16
ancova.fit.red <- lm(rsp ~ age, data=dat.anc2)
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
## 
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     34 34.757                           
## 2     32 33.783  2   0.97404 0.4613 0.6346

Summary: age explained a lot of variability in the response, but also differed across treatment groups. When we include age in the linear model, treatment is not significant. The differences in the mean responses across the treatment groups we observed initially was solely due to age being associated with both treatment and response (e.g., age is a confounder).

In linear models over-fitting does not cause bias in your predictor of interest (the standard error can be slightly inflated though), but under-fitting can cause bias. So in general, it is better to over-fit than to under-fit, especially when you have a lot of data.