```{r setup, include=FALSE} knitr::opts_chunk$set(echo=TRUE, fig.align="center") ``` #### Statistics for Laboratory Scientists ( 140.615 ) ## Multiple Linear Regression #### Example from class: two hemes Here are the data. ```{r} h2o2 <- c(0,0,0, 10,10,10, 25,25,25, 50,50,50) pf3d7 <- c(0.3399,0.3563,0.3538, 0.3168,0.3054,0.3174, 0.2460,0.2618,0.2848, 0.1535,0.1613,0.1525) pyoel <- c(0.3332,0.3414,0.3299, 0.2940,0.2948,0.2903, 0.2089,0.2189,0.2102, 0.1006,0.1031,0.1452) mydat <- data.frame(y = c(pf3d7,pyoel), x1 = rep(h2o2,2), x2 = rep(0:1,rep(length(h2o2),2))) mydat ``` Fit two separate linear regressions. ```{r} lm.outA <- lm(y ~ x1, data=mydat, subset=(x2==0)) lm.outB <- lm(y ~ x1, data=mydat, subset=(x2==1)) summary(lm.outA) summary(lm.outB) ``` Plot the data, and add the two regression lines. ```{r} par(las=1) plot(y ~ x1, data=mydat, type="n", xlab="H2O2 concentration", ylab="OD") points(y ~ x1, data=mydat, subset=(x2==0), col="blue", lwd=2) points(y ~ x1, data=mydat, subset=(x2==1), col="green", lwd=2) abline(lm.outA$coef, col="blue", lty=2, lwd=2) abline(lm.outB$coef, col="green", lty=2, lwd=2) ``` Fit the full model with the heme species / concentration interaction. ```{r} lm.out <- lm(y ~ x1 * x2, data=mydat) summary(lm.out) ``` Fit a reduced model, assuming that the two hemes have the same line. ```{r} lm.red <- lm(y ~ x1, data=mydat) summary(lm.red) ``` Compare the full and reduced models. ```{r} anova(lm.red, lm.out) ``` Fit a reduced model, assuming that the two hemes have the same slope. ```{r} lm.red <- lm(y ~ x1 + x2, data=mydat) summary(lm.red) ``` Compare the full and reduced models. ```{r} anova(lm.red, lm.out) ``` Note: the p-value is the same for the interaction term in the full model. The model parameters. ```{r} params <- lm.red$coef params ``` The intercept and slope of the first heme. ```{r} params[1:2] ``` The intercept of the second heme. ```{r} params[1]+params[3] ``` The intercept and slope of the second heme. ```{r} c(params[1]+params[3], params[2]) ``` Plot the data, and add the two regression lines. ```{r} par(las=1) plot(y ~ x1, data=mydat, type="n", xlab="H2O2 concentration", ylab="OD") points(y ~ x1, data=mydat, subset=(x2==0), col="blue", lwd=2) points(y ~ x1, data=mydat, subset=(x2==1), col="green", lwd=2) abline(params[1:2], col="blue", lty=2, lwd=2) abline(c(params[1]+params[3], params[2]), col="green", lty=2, lwd=2) ``` #### Another example: a 2^2 factorial design There are two data sets in the package SPH.140.615, with blood pressure in mice as outcome, and two factors (two levels each) as predictors: water (plan and salted) and diet (normal and high fat). Read the data, and convert the predictors to factors. ```{r} library(SPH.140.615) summary(dat.bp1) summary(dat.bp2) ``` Make boxplots of the data. ```{r} par(mfrow=c(1,2), las=2, mar=c(8,5,4,2)) boxplot(split(dat.bp1$bp, list(dat.bp1$diet,dat.bp1$water)), ylab="blood pressure") boxplot(split(dat.bp2$bp, list(dat.bp2$diet,dat.bp2$water)), ylab="blood pressure") ``` In both data sets, it looks like blood pressure is higher on average in the high fat diet arm, and it looks like blood pressure is also higher on average in the salted water arm. It also looks like there is a positive interaction between the two factors in the second data set. Let's see what the linear models say. ##### Data set 1 ```{r} lm.fit <- lm(bp ~ water * diet, data=dat.bp1) summary(lm.fit) ``` For the first data set, the interaction is not significant, so let's re-fit the linear model without it. ```{r} lm.fit <- lm(bp ~ water + diet, data=dat.bp1) summary(lm.fit) ``` Both factors are highly significant! Here is the interpretation of the parameter estimates. ```{r} params <- summary(lm.fit)$coef[,1] params ``` Comparing two mice who receive the same diet, one drinking salted water, one drinking plain water, we expect the blood pressure to be ```{r} params[2] ``` units higher for the mouse drinking salted water. Comparing two mice who drink the same type of water, one eating the high fat and the other the normal diet, we expect the blood pressure to be ```{r} params[3] ``` units higher for the mouse eating the high fat diet. ##### Data set 2 ```{r} lm.fit <- lm(bp ~ water * diet, data=dat.bp2) summary(lm.fit) ``` For the second data set, the interaction is significant, which makes the interpretation of the parameter estimates a bit trickier. ```{r} params <- summary(lm.fit)$coef[,1] params ``` Comparing two mice who receive the normal diet, one drinking salted water, one drinking plain water, we expect the blood pressure to be ```{r} params[2] ``` units higher for the mouse drinking salted water. Comparing two mice who receive the high fat diet, one drinking salted water, one drinking plain water, we expect the blood pressure to be ```{r} params[2]+params[4] ``` units higher for the mouse drinking salted water. Comparing two mice who drink plain water, one eating the high fat diet, one eating the normal diet, we expect the blood pressure to be ```{r} params[3] ``` units higher for the mouse eating the high fat diet. Comparing two mice who drink salted water, one eating the high fat diet, one eating the normal diet, we expect the blood pressure to be ```{r} params[3]+params[4] ``` units higher for the mouse eating the high fat diet. ##### An ANOVA is a linear model with factors! Data set 1, as a 2-way ANOVA with interaction. ```{r} aov.fit <- aov(bp ~ water * diet, data=dat.bp1) summary(aov.fit) ``` Data set 1, as a 2-way ANOVA without interaction. ```{r} aov.fit <- aov(bp ~ water + diet, data=dat.bp1) summary(aov.fit) ``` Data set 2, as a 2-way ANOVA with interaction. ```{r} aov.fit <- aov(bp ~ water * diet, data=dat.bp2) summary(aov.fit) ``` #### Example from class: flies, density, strain, no replicates The data. ```{r} flies <- data.frame(rsp = c( 9.6, 9.3, 9.3, 10.6, 9.1, 9.2, 9.8, 9.3, 9.5, 10.7, 9.1,10.0, 11.1,11.1,10.4, 10.9,11.8,10.8, 12.8,10.6,10.7), strain = factor(rep(c("OL","BELL","bwb"),7)), density = factor(rep(c(60,80,160,320,640,1280,2560),rep(3,7)))) summary(flies) ``` Plot the data. ```{r} par(las=1) interaction.plot(flies$density, flies$strain, flies$rsp, lwd=2, col=c("blue","red","green3"), lty=1, type="b", xlab="density", ylab="response") ``` We don't have replicates in the cells, so for the two-way ANOVA we had to assume there is no interaction (otherwise we didn't have any degrees of freedom to fit an error). ```{r} flies.aov <- aov(rsp ~ strain + density, data=flies) anova(flies.aov) ``` In such an analysis, the levels of a factor do not have any particular ordering. We simply ask whether there are differences in means between the levels. However, here it looks like there is an ordering between the levels of the factor 'density', higher densities seem to lead to a longer development period. So maybe we should look at the densities as numeric quantities rather than levels of an (unordered) factor. The strain is still a factor, and our primary question might still be about differences between strains. Such an analysis, with a factor and a numeric predictor, is called an Analysis of Covariance (ANCOVA). So let's record the densities as numeric quantities instead. ```{r} flies <- data.frame(rsp = c( 9.6, 9.3, 9.3, 10.6, 9.1, 9.2, 9.8, 9.3, 9.5, 10.7, 9.1,10.0, 11.1,11.1,10.4, 10.9,11.8,10.8, 12.8,10.6,10.7), strain = factor(rep(c("OL","BELL","bwb"),7)), density = rep(c(60,80,160,320,640,1280,2560),rep(3,7))) summary(flies) ``` Plot the data. ```{r} plot(flies$density, flies$rsp, type="n", xlab="density", ylab="response") tmp <- subset(flies, strain=="OL") lines(tmp$density, tmp$rsp, col="green", type="b") tmp <- subset(flies, strain=="BELL") lines(tmp$density, tmp$rsp, col="blue", type="b") tmp <- subset(flies, strain=="bwb") lines(tmp$density, tmp$rsp, col="red", type="b") ``` The relationship between density and the response does not look linear. Let's try a logarithmic transformation of the densities. Note that the density levels typically double in the number of flies, so a logarithmic transformation makes sense. ```{r} plot(flies$density, flies$rsp,type="n", log="x") tmp <- subset(flies, strain=="OL") lines(tmp$density, tmp$rsp, col="green", type="b") tmp <- subset(flies, strain=="BELL") lines(tmp$density, tmp$rsp, col="blue", type="b") tmp <- subset(flies, strain=="bwb") lines(tmp$density, tmp$rsp, col="red", type="b") ``` That looks better. Let's fit the linear model with factor 'strain' and log2 'density' as a numeric predictor. ```{r} ancova.fit <- lm(rsp ~ strain + log2(density), data=flies) summary(ancova.fit) ``` We are testing if the factor 'strain' is significant. ```{r} ancova.fit.red <- lm(rsp ~ log2(density), data=flies) summary(ancova.fit.red) anova(ancova.fit.red, ancova.fit) ``` The full linear model indicates a significant difference between BELL (baseline) and OL, but not between BELL and bwb. Let's fit a model with strain being OL versus BELL/bwb. ```{r} lm.fit <- lm(rsp ~ ifelse(strain=="OL",1,0) + log2(density), data=flies) summary(lm.fit) ``` Here is the interpretation of our model. ```{r} params <- summary(lm.fit)$coef[,1] params ``` Comparing strains of flies at any density, we expect the OL flies to take ```{r} params[2] ``` days longer to develop than either the BELL or bwb flies. Comparing flies of the same strain at two densities, one twice as large as the other, we expect the flies to take ```{r} params[3] ``` days longer to develop in the more crowded environment.