Extending SS-ANOVA Models with Arbitrary Relationship Data

[+/-] show/hide project details

 

A novel method for incorporating arbitrary pedigree data into smoothing spline ANOVA (SS-ANOVA) models.
  • H. Corrada Bravo, K.E. Lee, B.E.K. Klein, R. Klein, S.K. Iyengar and G. Wahba. "Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models." Proceedings of the National Academy of Science 106:20 8128-8133, May 2009. [PNAS] [pdf]

ABSTRACT Our goal is investigating the relative importance of genetic, environmental and familial components in a flexible nonparametric predictive eye disease risk model.
By expressing pedigree data as a positive semidefinite kernel matrix, the SS-ANOVA model is able to estimate a log-odds ratio as a multicomponent function of several variables: one or more functional components representing information from environmental covariates and/or genetic marker data and another representing pedigree relationships. We propose two methods for creating positive semidefinite kernels from pedigree information, including the use of Regularized Kernel Estimation (RKE).
We present a case study on models for retinal pigmentary abnormalities in the Beaver Dam Eye Study (BDES). Our model verifies known facts about the epidemiology of this eye lesion - found in eyes with early age-related macular degeneration (AMD) - and shows significantly increased predictive ability in models that include all three of the genetic, environmental and familial data sources. The case study also shows that models that contain only two of these data sources, that is, pedigree-environmental covariates, or pedigree-genetic markers, or environmental covariates-genetic markers, have comparable predictive ability, while less than the model with all three. This result is consistent with the notions that genetic marker data encodes - at least partly - pedigree data, and that familial correlations encode shared environment data as well.
  • H. Corrada Bravo, G. Wahba, K.E. Lee, B.E.K. Klein, R. Klein, S.K. Iyengar. "Examining the relative influence of familial, genetic and environmental covariate information in flexible risk models with application to ophthalmology data." University of Wisconsin-Madison Dept. of Statistics Technical Report 1148, 2008. [pdf].

 


Probabilistic models for second-generation sequencing quality assessment, base-calling and mapping

[+/-] show/hide project details

 

Simple models of intensity measurements in second-generation sequencing data give easily interpretable quality assessment metrics while capturing uncertainty in base-calling and mapping.
  • H. Corrada Bravo, R.A. Irizarry. "Model-based quality assessment and base-calling for second-generation sequencing data" (Novmeber 2009). Biometrics. Published online before print, Novemeber 13, 2009. doi10.1111/j.1541-0420.2009.01353.x [Biometrics] [pdf] [software]

ABSTRACT Second-generation sequencing (sec-gen) technology is capable of sequencing millions of short fragments of DNA in parallel and can be used to assemble complex genomes for a fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1,000 Genomes Project, plans to sequence the genomes of approximately 1,200 people. The possibility of comparative analysis at the sequence level of a large number of samples across multiple populations may be achievable within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads, which are the result of complex processing of noisy continuous fluorescence intensity data. This complex processing, known as base-calling, results in discretized sequence reads of widely varying quality. Furthermore, this variation in processing quality results in infrequent but systematic errors that we have found to be misleading in downstream analysis of the discretized data at the sequence read level. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is important. We present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling, which allow for informative and easily interpretable metrics that capture the variability in sequencing quality. In contrast to other recently proposed methods for improved base-calling in the Illumina platform, our model provides informative estimates readily usable in quality assessment tools while retaining base-calling performance.
  • H. Corrada Bravo, R. Irizarry. "Model-based quality assessment and base-calling for second-generation sequencing data" (April 2009). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 184. [local pdf] [COBRA pdf]

 


Estimating Tree-Structured Covariance Matrices via Mixed Integer Programming

[+/-] show/hide project details

 

A novel method for estimating tree-structured covariance matrices directly from observed continuous data.
  • H. Corrada Bravo, S. Wright, K. H. Eng, S. Keles and G. Wahba. "Estimating Tree-Structured Covariance Matrices via Mixed-Integer Programming." Journal of Machine Learning Research W&CP 5:33-40, 2009 (AISTATS 2009). [local pdf] [JMLR pdf].

ABSTRACT A representation of this class of matrices as linear combinations of rank-one matrices indicating object partitions is used to formulate estimation as instances of well-studied numerical optimization problems. In particular we propose projection under Frobenius or sum-absolute-value norms as estimation procedures and show that for fixed topologies these projection problems can be solved by linear and quadratic programming respectively. Furthermore, we use linear and quadratic Mixed-Integer Programming to solve projection problems of unknown topology.
Our main application is the discovery of phylogenetic structured in gene expression data. Recent phylogenetic comparative analyses of gene expression use tree-structured covariance matrices to correct for lack of independence. Typically, a phylogenetic tree derived from DNA or amino acid sequence is used to define this covariance matrix. However, recent results on the variability of tree topology of sequence-derived trees contingent on the region of the genome used to construct the tree lead us to propose our method as an exploratory tool where tree-structured covariance matrices are estimated directly from gene expression data. This exploratory tool can guide investigators in their modeling choices for comparative analysis.
  • H. Corrada Bravo, K. H. Eng, S. Keles, G. Wahba and S. Wright. "Estimating Tree-Structured Covariance Matrices via Mixed-Integer Programming with an Application to Phylogenetic Analysis of Gene Expression." University of Wisconsin Dept. of Statistics Technical Report 1142, 2008. [pdf].
  • We use the mixed-integer solver in CPLEX in this project. We have written an R interface to CPLEX (Linux and Windows) available here.
  • K.H. Eng, H. Corrada Bravo, G. Wahba, S. Keles. "A phylogenetic mixture model for the evolution of gene expression." University of Wisconsin-Madison Dept. of Statistics Technical Report 1144, 2008. [pdf].

 


Probabilistic Inference and Aggregate Database Queries

[+/-] show/hide project details

 

This paper presents a broad class of aggregate queries, called MPF queries, inspired by the literature on probabilistic inference in statistics and machine learning.
  • H. Corrada Bravo, R. Ramakrishnan. "Optimizing MPF Queries: Decision Support and Probabilistic Inference". SIGMOD '07, June 11-14 2007, Beijing, China. [pdf] © ACM.

ABSTRACT

An MPF (Marginalize a Product Function) query is an aggregate query over a stylized join of several relations. In probabilistic inference, this join corresponds to taking the product of several probability distributions, while the aggregate operation corresponds to marginalization. Probabilistic inference can be expressed directly as MPF queries in a relational setting, and therefore, by optimizing evaluation of MPF queries, we provide scalable support for probabilistic inference in database systems. To optimize MPF queries, we build on ideas from database query optimization as well as traditional algorithms such as Variable Elimination and Belief Propagation from the probabilistic inference literature.

Although our main motivation for introducing MPF queries is to support easy expression and efficient evaluation of probabilistic inference in a DBMS, we observe that this class of queries is very useful for a range of decision support tasks. We present and optimize MPF queries in a general form where arbitrary functions (i.e., other than probability distributions) are handled, and demonstrate their value for decision support applications through a number of illustrative and natural examples. We have also extended the PostgresSQL optimizer to implement these optimization techniques.

  • H. Corrada Bravo, R. Ramakrishnan. "Optimizing MPF Queries: Decision Support and Probabilistic Inference". Computer Sciences TR-1567 2006.
  • H. Corrada Bravo, R. Ramakrishnan. "Optimizing MPF Query Workloads: View Materialization Techniques for Probabilistic Inference". In preparation.

 


Graph-Based Data Analysis

 

Ph.D. dissertation for the Dept. of Computer Sciences, University of Wisconsin-Madison. [pdf]

Contact :

615 N. Wolfe St. E5618
Baltimore, MD 21205
Fax: 410-955-0958
email: hcorrada@jhsph.edu