This guide shows how to use R for analyzing cardiovascular proteomics data derived from mass spectrometry plattforms TMT or iTRAQ. This analysis pipeline contains code for data preprocessing, data normalization, and performing a two sample comparison using ordinary and moderated t-test statistics.
We provide a straightforward analysis pipeline for 10-plex TMT experiments. Hereby, we start with R-code for one experiment followed by R-code for statistical inference when multiple experiments are present.
Given the data of one 10-plex TMT experiment, we need to run the following three main functions to obtain a list of proteins where changes in protein abundance between, e.g., treated and untreated cells are calculated. Details are provided in the following paragraphs:
read.peptides
quantify.proteins
eb.fit
.
The reader is expected to have basic R knowledge. For this analysis we used the (current) Bioconductor release (version 3.1) and R version 3.2.1.
Please install R from CRAN and the basic Bioconductor packages by typing the following code in your R command window:
source("http://bioconductor.org/biocLite.R")
biocLite()
In a next step, install additional Bioconductor packages that are necessary for the proteomic data analysis. Please make sure to install and update all dependencies when you are asked.
biocLite("limma")
biocLite("qvalue")
You can also check the Bioconductor installation page.
Load the following packages by typing in your R command window
library(limma)
library(qvalue)
In a first step, we read raw data from a csv-file and store the data set in a data frame that is named dat
. The commands dim
and str
give an overview of the dimensions and the structure of the data. This exemplary data set contains simulated data for one 10-plex TMT experiment.
# read simulated TMT data set
dat <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment.csv")
dim(dat)
## [1] 22624 16
str(dat)
## 'data.frame': 22624 obs. of 16 variables:
## $ Protein.Group.Accessions: int 4506975 4506975 18079216 12707580 16579885 16579885 16579885 16579885 16579885 117938274 ...
## $ Protein.Descriptions : Factor w/ 3895 levels "1-acyl-sn-glycerol-3-phosphate acyltransferase delta [Homo sapiens]",..: 3195 3195 633 2259 155 155 155 155 155 2034 ...
## $ Sequence : Factor w/ 10980 levels "aAAAAAAAAAAAAAAGAGAGAk",..: 1 1 2 3 4 5 5 5 5 6 ...
## $ Quan.Usage : Factor w/ 1 level "Used": 1 1 1 1 1 1 1 1 1 1 ...
## $ Quan.Info : Factor w/ 1 level "Unique": 1 1 1 1 1 1 1 1 1 1 ...
## $ Isolation.Interference : int 25 38 75 31 22 53 8 10 11 69 ...
## $ X126 : num 6712160 1873633 593284 96868 4322936 ...
## $ X127_N : num 221073 392450 4248749 531059 312364 ...
## $ X127_C : num 400791 612276 1302342 731533 592202 ...
## $ X128_N : num 580248 834695 5276243 226954 777629 ...
## $ X128_C : num 2089265 1548747 432149 241949 886331 ...
## $ X129_N : num 557290 355312 1917577 1688991 6639442 ...
## $ X129_C : num 391106 20691 69547 1192171 1371124 ...
## $ X130_N : num 1314488 433218 826219 653767 684538 ...
## $ X130_C : num 1016075 2329839 3890083 208666 1504843 ...
## $ X131 : num 227170 442554 366487 803876 1755958 ...
To perform the analysis correctly the following columns are required to be included in the data set and should be named in the same way as in this exemplary data set. Each row corresponds to a peptide spectra
Protein.Group.Accessions |
protein group accession or a protein id for the spectra |
Sequence |
name of peptide sequence |
Quan.Usage |
indicator whether the spectra is used or not used |
Quan.Info |
indicator whether the spectra is unique or redundant |
Isolation.Interference |
% of isolation interference |
The column Protein.Descriptions
is optional and maybe helpful for deeper investigation after detecting significantly differentially expressed proteins.
For each row, the columns X126, X127_N, ..., X131
contain measurements for the 10 channels.
In the next step we define a vector of headlines of all channels as they are stored in the data set dat
.
cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")
R code for all function is provided in the appendix.
The function read.peptides
does data preprocessing and excludes spectra (rows in the data set) that contain missing values, a higher isolation interference than 30%, and that are redundant
or not used
. The exclusion of spectra with a higher isolation interference than 30% is a suggestion.
THe funcion quantify.proteins
quantifies all proteins that are represented by peptide spectra. Channel values for each protein are median polished log2
transformed values from the median value of all peptides belonging to each protein.
The function eb.fit
(eb: empirical Bayes) performes the statistical analysis of interest: two sample t-tests. The output of this procedure is a data frame that in particular contains for each protein its log2 (fold-change) and its ordinary as well as moderated p-values and q-values. The results are sorted by moderated p-values in increasing order.
dat <- read.peptides(dat, cha)
dim(dat) # 19047 peptides
## [1] 19047 16
# identify proteins from peptide spectra
# (takes aorund one minute with an average laptop)
dat <- quantify.proteins(dat, cha)
dim(dat) # 3569 proteins identified
## [1] 3569 12
# find "one-hit wonders"
dat.onehit <- subset(dat, dat$n.peptides == 1)
dim(dat.onehit) # 1821 proteins are identified by only one peptide
## [1] 1821 12
# eliminate "one-hit wonders"
dat <- subset(dat, dat$n.peptides != 1)
dim(dat) # 1748 proteins are identified by at least two peptides
## [1] 1748 12
# boxplot: intensities of all eight channels after data preprocessing and normalization
par(mfrow=c(1,1), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
boxplot(dat[, 1:length(cha)], ylim = c(-3, 3), main="Boxplot normalized Intensities")
# define treatment (tr) and control (ct) groups for a two group comparison, assuming 5 cases and 5 controls
tr <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C")
ct <- c("X129_N", "X129_C", "X130_N", "X130_C", "X131")
# define design according to limma package framework
design <- model.matrix(~factor(c(2,2,2,2,2,1,1,1,1,1)))
design
## (Intercept) factor(c(2, 2, 2, 2, 2, 1, 1, 1, 1, 1))2
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 1
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(c(2, 2, 2, 2, 2, 1, 1, 1, 1, 1))`
## [1] "contr.treatment"
colnames(design) <- c("Intercept", "tr-cr")
res.eb <- eb.fit(dat[, c(tr,ct)], design)
head(res.eb)
## log2FC t.ord t.mod p.ord p.mod q.ord
## 189027129 2.3440 5.026608 5.133506 0.001018458 0.0002793448 0.5819295
## 82775371 1.4416 6.226266 4.676150 0.000252119 0.0005926906 0.3435027
## 66932916 1.4338 4.002641 3.742556 0.003935244 0.0029926666 0.8222439
## 28827795 -1.2990 -4.312766 -3.738402 0.002571040 0.0030149568 0.8222439
## 63252888 1.6682 3.676763 3.725913 0.006248238 0.0030830064 0.8222439
## 226246591 1.6096 3.659504 3.674562 0.006405803 0.0033798430 0.8222439
## q.mod df.r df.0 s2.0 s2 s2.post
## 189027129 0.4821453 8 3.553636 0.4707887 0.5436323 0.5212273
## 82775371 0.5114880 8 3.553636 0.4707887 0.1340214 0.2376034
## 66932916 0.7428239 8 3.553636 0.4707887 0.3207923 0.3669278
## 28827795 0.7428239 8 3.553636 0.4707887 0.2268018 0.3018466
## 63252888 0.7428239 8 3.553636 0.4707887 0.5146416 0.5011534
## 226246591 0.7428239 8 3.553636 0.4707887 0.4836504 0.4796945
The rownames of the generated data frame res.eb
correspond to protein groups accessions numbers provided by the original raw data set. Each protein (row) has the following columns:
log2FC |
estimate of the log2-fold-change corresponding to the effect size |
t.ord |
vector of ordinary t-statistic |
t.mod |
moderated t-statistic |
p.ord |
ordinary p-value corresonding to the ordinary t-statistic |
p.mod |
moderated p-value corresonding to the moderated t-statistic |
q.ord |
ordinary q-value corresonding to the ordinary t-statistic |
q.mod |
moderated q-value corresonding to the moderated t-statistic |
df.r |
residual degrees of freedom assiciated with ordinary t-statistic and p-value |
df.0 |
degrees of freedom associated with s2.0 |
s2.0 |
estimated prior value for the variance |
s2 |
sample variance |
s2.post |
posterior value for the variance |
The protein with protein group accessions 189027129 listed on top of the results chart has an estimated log2
-effect size of 2.344 between five cases and five controls. The moderated p-value of 0.00027
is considerably smaller that the ordinary p-value of 0.001
. A moderated q-value of 0.48215 indicates that this protein cannot be declared significant if the false discovery rate (FDR) should be controlled at 5%.
The following volcano plots visualize the shrinkage effect of the moderated p-values compared to the orinary by plotting the significance versus log2-fold-change on the y- and x-axes, respectively.
# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb$log2FC))*1.1
ry <- c(0, ceiling(max(-log10(res.eb$p.ord), -log10(res.eb$p.mod))))
par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")
plot(res.eb$log2FC, -log10(res.eb$p.ord), pch=21, bg="lightgrey", cex=0.9,
xlim=rx, ylim=ry, xaxt="n",
xlab="fold change", ylab="-log10 p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")
plot(res.eb$log2FC, -log10(res.eb$p.mod), pch=21, bg="lightgrey", cex=0.9,
xlim=rx, ylim=ry, xaxt="n",
xlab="fold change", ylab="-log10 p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")
In this section, we analyze three TMT experiments together. For simplicity, we assume that each of the three data sets is already preprocessed separately by applying read.peptides
and quantify.proteins
to the original data sets. We also sugguest to remove one-hit wonders. Each entry in each data set represents a simulated measurement for one protein (rows) in a specific channel (columns).
First, we additionally load the function eb.fit.mult
which is an extension of eb.fit
when multiple experiments are present.
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.mult.r")
Next, we read the data of three TMT experiments that are stored in csv-files.
# read preprocessed data sets
dat1 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_1.csv")
dat2 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_2.csv")
dat3 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_3.csv")
# set channel names
cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")
Now, we estimate the treatment effect by adjusting the statistical model for the three different experiments.
# limma, factorial design; the first 10 columns of ech data set contain the channel measurements
# Here, dat1, dat2, and dat3 only consist of 10 channel columns, but there might me addional information saved in the data sets
dat <- data.frame(dat1[, 1:10], dat2[, 1:10], dat3[, 1:10])
tr <- as.factor(rep(c(2,2,2,2,2,1,1,1,1,1), 3))
ex <- as.factor(c(rep(1,10), rep(2,10), rep(3,10)))
design <- model.matrix(~ ex + tr)
res.eb.mult <- eb.fit.mult(dat, design)
head(res.eb.mult)
## log2FC t.ord t.mod p.ord p.mod q.ord
## 52 0.3400148 8.819983 7.800588 2.706211e-09 1.108432e-08 2.703505e-06
## 54 0.3128688 7.240442 6.621382 1.089316e-07 2.583376e-07 5.283399e-05
## 48 0.3910481 6.785453 6.607751 3.350579e-07 2.681472e-07 7.642497e-05
## 75 0.3189281 7.087028 6.550356 1.586606e-07 3.137650e-07 5.283399e-05
## 66 0.4260741 6.589929 6.533841 5.471007e-07 3.282933e-07 7.642497e-05
## 37 0.4693814 6.381745 6.431789 9.264516e-07 4.344809e-07 9.255252e-05
## q.mod df.0 df.r s2.0 s2 s2.post
## 52 1.107324e-05 3.776869 26 0.03561429 0.01114606 0.01424958
## 54 6.559300e-05 3.776869 26 0.03561429 0.01400411 0.01674512
## 48 6.559300e-05 3.776869 26 0.03561429 0.02490944 0.02626723
## 75 6.559300e-05 3.776869 26 0.03561429 0.01518863 0.01777940
## 66 6.559300e-05 3.776869 26 0.03561429 0.03135235 0.03189293
## 37 6.566976e-05 3.776869 26 0.03561429 0.04057273 0.03994381
Note, that there may be a warning message that NAs
occure. This is due to missing values in the data sets (proteins that are not present in all three data sets) and does not effect the results for statistical inference.
Create volcano plots:
# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb.mult$log2FC), na.rm = TRUE)*1.1
ry <- c(0, ceiling(max(-log10(res.eb.mult$p.ord), -log10(res.eb.mult$p.mod), na.rm = TRUE)))
par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")
plot(res.eb.mult$log2FC, -log10(res.eb.mult$p.ord), pch=21, bg="lightgrey", cex=0.9,
xlim=rx, ylim=ry, xaxt="n",
xlab="fold change", ylab="-log10 p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")
plot(res.eb.mult$log2FC, -log10(res.eb.mult$p.mod), pch=21, bg="lightgrey", cex=0.9,
xlim=rx, ylim=ry, xaxt="n",
xlab="fold change", ylab="-log10 p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")
read.peptides <- function(dat, cha){
output <- NULL
dat$Sequence <- as.character(dat$Sequence)
dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
dat$Quan.Usage <- as.character(dat$Quan.Usage)
dat$Quan.Info <- as.character(dat$Quan.Info)
dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
dat <- subset(dat, Isolation.Interference<=30)
dat <- subset(dat, Quan.Usage=="Used")
dat <- subset(dat, Protein.Group.Accessions!="")
dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))
}
quantify.proteins <- function(dat, cha){
e.function <- function(x, seq) tapply(x, seq, median)
output <- NULL
dat$Sequence <- toupper(dat$Sequence) # Capital letters
accessions <- as.character(unique(dat$Protein.Group.Accessions))
n.proteins <- length(accessions)
n.cha <- length(cha)
for(k in 1:n.proteins){
id <- accessions[k]
sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
sdat[cha] <- log2(sdat[cha])
sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
pdat <- sdat[, -1]
n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])
n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)
if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
output <- rbind(output, pdat)
}
output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
output[,1:n.cha] <- sweep(output[,1:n.cha],2,apply(output[,1:n.cha],2,median))
output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
output[,1:n.cha] <- round(output[,1:n.cha],3)
row.names(output) <- accessions
output <- as.data.frame(output)
return(output)
}
eb.fit <- function(dat, design){
n <- dim(dat)[1]
fit <- lmFit(dat, design)
fit.eb <- eBayes(fit)
log2FC <- fit.eb$coefficients[, 2]
df.r <- fit.eb$df.residual
df.0 <- rep(fit.eb$df.prior, n)
s2.0 <- rep(fit.eb$s2.prior, n)
s2 <- (fit.eb$sigma)^2
s2.post <- fit.eb$s2.post
t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
t.mod <- fit.eb$t[, 2]
p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
p.mod <- fit.eb$p.value[, 2]
q.ord <- qvalue(p.ord)$q
q.mod <- qvalue(p.mod)$q
results.eb <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
results.eb <- results.eb[order(results.eb$p.mod), ]
return(results.eb)
}
eb.fit.mult <- function(dat, design){
n <- dim(dat)[1]
fit <- lmFit(dat, design)
fit.eb <- eBayes(fit)
log2FC <- fit.eb$coef[, "tr2"]
df.0 <- rep(fit.eb$df.prior, n)
df.r <- fit.eb$df.residual
s2.0 <- rep(fit.eb$s2.prior, n)
s2 <- (fit.eb$sigma)^2
s2.post <- fit.eb$s2.post
t.ord <- fit.eb$coef[, "tr2"]/fit.eb$sigma/fit.eb$stdev.unscaled[, "tr2"]
t.mod <- fit.eb$t[, "tr2"]
p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
p.mod <- fit.eb$p.value[, "tr2"]
q.ord <- q.mod <- rep(NA,n)
ids <- which(!is.na(p.ord))
k <- 0
q1 <- qvalue(p.ord[!is.na(p.ord)])$q
q2 <- qvalue(p.mod[!is.na(p.mod)])$q
for(i in ids){
k <- k+1
q.ord[i] <- q1[k]
q.mod[i] <- q2[k]
}
results.eb.mult <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
results.eb.mult <- results.eb.mult[order(results.eb.mult$p.mod), ]
return(results.eb.mult)
}