Preprocessing and Analysis of

a Single Affymetrix Microarray



Abstract

Methods for single microarray preprocessing -- frozen robust multiarray analysis (fRMA) -- and analysis -- gene expression barcode -- have recently been developed and implemented for Affymetrix's hgu133a platform. We have created a collection of user-friendly R packages that allow the user to extend these methods to any Affymetrix microarray platform and have already done so for Affymetrix's hgu133plus2 and mouse4302 platforms. Extending these methods to other manufacturers is straightforward and under development.



The vast majority of methods for the preprocessing and analysis of microarray gene expression data rely on multiple arrays. However, for microarrays to be used in a clinical setting to aid in diagnosis or treatment, one needs to be able to gain useful information from a single microarray hybridization. Specifically, one needs to be able to: (1) preprocess a single microarray, and (2) estimate the expression of each gene on the array. Recent work by McCall et al. (2010) provided a method of single array preprocessing called Frozen Robust Microarray Analysis (fRMA). Previous work by Zilliox and Irizarry (2007) and, more recently, by McCall et al. (2009) showed how to obtain gene expression estimates from the preprocessed data. Specifically, Zilliox and Irizarry (2007) developed a method to map gene intensities into a vector of ones and zeros denoting which genes are expressed (ones) and unexpressed (zeros) in a given sample. They called this sequence of ones and zeros a gene expression barcode. McCall et al. (2009) improved upon the methods of Zilliox and Irizarry (2007). For a description of the statistical methodology underlying these methods, we refer the reader to the original papers.

In this manuscript, we focus on the computational tools created to implement these methods and their extension to additional microarray platforms. The computational tools described here are written in the open-source statistical language R and have been submitted to the Bioconductor project (Gentleman et al., 2004), a collaborative effort to produce computational tools for biological data. We have implemented the primary tools necessary to preprocess and analyzed data from a single microarray hybridization in a single R package, frma. This package provides the fundamental framework for single array preprocessing and analysis; however, it requires an additional platform specific data packages to preprocess and analyze a single array -- <platform>frmavecs and <platform>barcodevecs respectively. To extend these methods to a new platform one need only create such a package or supply ones own data to the functions. Furthermore, the frmaTools package contains functions to aid in the creation of such data packages. Lastly, because the source code is made freely avaiable, all of these packages are easily customizable.


Raw Data

For Affymetrix microarray data, it is customary to view CEL files as the starting point for preprocessing and analysis. One or more CEL files can be read into R using the ReadAffy function from the affy package to produce an AffyBatch object (Gautier et al., 2004). We adopt this convention and refer to the probe-level intensities from a single CEL file stored in an AffyBatch object as the raw data.


Frozen Robust Multiarray Analysis

The goal of fRMA is to obtain reliable gene-level intensities from a single microarray hybridization. This amounts to converting raw probe-level intensities into background-corrected, normalized gene-level intensities. The frma package contains a function frma which takes an AffyBatch as input and produces an object containing gene-level expression values. This object can take one of two forms, an ExpressionSet or a frmaExpressionSet, depending upon which additional information is requested. The gene-level intensities stored in both these objects can be accessed using the exprs method.

To preprocess single arrays using fRMA, one needs a number of frozen parameter vectors. Among these are the reference distribution to which the data are normalized and the probe-effect estimates. We have computed these frozen parameters for three popular Affymetrix platforms -- hgu133a, hgu133plus2, and mouse4302. The data for each of these platforms are stored in R packages of the form <platform>frmavecs. By default, the frma function attempts to load the appropriate data package for the given AffyBatch. However, it is possible for the user to supply some or all of the frozen parameter vectors. Furthermore, the frmaTools package contains functions to create the required vectors from a large AffyBatch object. This allows the user to easily produce ones own frozen parameter estimates for any platform for which there is a suitable amount of data available making the frma algorithm easily extendable and customizable.

The default arguments to the frma function implement the method described in McCall et al. (2010); however, there are many additional options implemented. These optional arguments to the frma fucntion allow the user to control each stage of the preprocessing, as well as, the information returned by the function. This flexibility is instrumental in allowing users to easily explore alternative methods of preprocessing.


Gene Expression Barcode

The creation of a gene expression barcode is designed to convert gene-level intensities into gene expression estimates. The algorithm originally proposed by Zilliox and Irizarry (2007) and improved by McCall et al. (2009) is also implemented in the frma package, which contains a function barcode. By default this function takes the output from frma and creates a gene expression barcode; however, there are other input and output options available to users.

Similar to the frma function, the barcode function requires platform specific precomputed parameters. These parameters are stored in R packages of the form <platform>barcodevecs. In the default implementation, the barcode function attempts to load the appropriate data package for the given ExpressionSet object. frmaExpressionSet objects are converted to ExpressionSet objects when passed to the barcode function. It is also possible for the user to supply the necessary parameters via optional arguments.


Concluding Remarks

Since the publication of the original papers (McCall et al., 2009, 2010), there have been two primary improvements: (1) user-friendly R package implementations of the fRMA and barcode algorithms have been added to the Bioconductor project, and (2) the extension of these algorithms to other platforms has been facilitated by code that is easily customizable and extendable.

Using the tools available to extend the functionality of these packages, we have implemented the fRMA and barcode algorithms on Affymetrix's hgu133plus2 and mouse4302 platforms and have greatly increased the amount of data which can be preprocessed and analyzed using our methods. There are over 35,000 hgu133plus2 arrays and over 14,000 mouse4302 arrays available via the Gene Expression Omnibus (GEO) (Edgar et al. 2002). Furthermore, by extending these methods to a mouse array, one can compare results from a common model organism to human results.

We have sought to make the frma and barcode packages as customizable as possible. The motivation behind this is two-fold. First, we hope that this will encourage users to extend our methods to additional microarray platforms. Second, this allows the user to critically examine and potentially improve upon our methods. The latter is perhaps the ultimate goal of open-source software and scientific inquiry in general.

References


1. Edgar, R., Domrachev, M., and Lash, A. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30(1); 207.
2. Gautier, L., Cope, L., Bolstad, B., and Irizarry, R. (2004). affy: analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3).
3. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothom, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology, 5, R80.
4. McCall, M., Zilliox, M., Irizarry, R. (2009). Gene Expression Barcodes Based on Data from 8,277 Microarrays. Johns Hopkins University Dept. of Biostatistics Working Papers, page 200.
5. McCall, M. N., Bolstad, B. M., Irizarry, R. A. (2010). Frozen robust multiarray analysis (fRMA). Biostatistics, page kxp059.
6. Zilliox, M. and Irizarry, R. (2007). A gene expression bar code for microarray data. Nature Methods, 4, 911-913.