Statistics for Genomics

The course exposes students to active research in computational genomics and introduces advanced statistical methods for solving bioinformatics problems. Topics include (1) microarray analysis: normalization, preprocessing, differential gene expression, multiple hypothesis testing, false discovery rate; (2) SNP arrays: genotyping, copy number variations, genome-wide association studies; (3) tiling arrays: ChIP-chip, model-based background correction, data segmentation, hidden Markov models, hierarchical mixture models; (4) gene regulation, epigenetics and epigenomics; (5) next-generation sequencing: ChIP-seq, RNA-seq, models and analysis; (6) Flow cytometry, FACS, normalization and clustering; (7) LC-MS proteomics, peak detection, quantification, and downstream analysis. (8) Combining technologies to understand gene regulation; (9) genomic structural aberrations including transposons, miRNA and others.


5/13/11 Today's R Lab will be held in the Genome Cafe (E3607) from 1:30PM-3:00PM. Andrew Jaffe will be leading the lab. The lab files are posted below. You may want to download the data before the start of the lab.
5/13/11 Today's R lab (5/13) will be held at 1:30PM. The location/notes will be announced here and by class email by 10AM.
5/5/11 Tomorrow's R Lab (5/6) has been cancelled.
4/28/11 Tomorrrow's R Lab will be held in the Genome Cafe (E3607) from 10:30AM-11:50AM. Andrew Jaffe will be leading the lab
4/25/11 First draft of final projects are due 5/9/10 by 5pm.
4/18/11 There is no homework assignment due today. The next homework assignment (HW3 - due April 25) has been posted below
4/14/11 Copy number R lab posted below. Run the first three commands before lab tomorrow, since they take a little while
4/11/11 Homework 3 will be assigned Wednesday when we discuss multiple testing.
4/6/11 I will be unavailable Wednesday (4/6/11) morning. We will move up the lectures on SNP chips and copy number variation by Rob Scharpf to tomorrow.
4/1/11 The undergrad guide to R and lab 1 are now linked from the website, the time/location of the lab is also updated below.
3/31/11 The R Lab for Today has been moved to W3031.
3/28/11 : One paragraph project proposals are due by 5pm Friday, April 8th. Please email a one paragraph project proposal to jleek "at" with subject [project proposal]. The proposal should include a brief description of the data set and specific analysis questions. If you need help coming up with an idea, contact the instructor for project ideas.


Jeff Leek


Office: 615 N. Wolfe St., Rm. E3624
Email: jleek "at"
Office hours: By Appointment
Office Phone: 410-955-1166

TA: Simina Boca
TA's Email: sboca "at"

Class Times

Monday 10:30 - 11:50am
Wednesday 10:30 - 11:50am

Lab Time

Friday 1:30PM - 2:50PM


Class: Wolfe W2033 Lab: Wolfe W3031


[ PDF ]

Final Project

Due Date: May 20th, 5pm Page limit: <= 6 pages; <= 3,000 words; <= 5 display items (including figures and tables)

Lecture Slides and Readings

Mar 28 (Mon): Course overview, molecular biology, bioinformatics, role of statistics (Leek)
1. Spielman et al. (2007) "Common genetic variants account for differences in gene expression among ethnic groups" Nature Genetics Spielman
2. Akey et al. (2007) "On the design and analysis of gene expression studies in human populations" Nature Genetics Akey Letter
3.Speilman et al. (2007) "Reply to: On the design and analysis of gene expression studies in human populations" Nature Genetics Spielman Response

HW 1 - Due April 4th 10:30AM HW1

Mar 30 (Wed): Microarray analysis, normalization, preprocessing, probe effect,background correction (Leek)
Slides: Preprocessing
1.Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4: 249-64.
2. Dabney AR and Storey JD. (2007) A new approach to intensity-dependent normalization of two-channel microarrays. Biostatistics, 8:128-138.

Apr 2 (Fri): R Lab 1 (Boca)
The Undergraduate Guide to R

Apr 4 (Mon):Differential gene expression, gene set analysis, gene ontology (Leek)
Slides: Differential Expression
1. Eisen MB, Spellman PT, Brown PO, and Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. PNAS, 95: 14863-14868. Eisen
2. Tusher VG, Tibshirani R, and Chu G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS, 98: 5116-5121. Tusher
4. Subramanian A et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102: 15545-15550.

HW 2 - Due April 11th 10:30AM HW2

Apr 6 (Wed): SNP arrays, genotyping, copy number variations (Scharpf)
Slides: SNP/GT/CNV (pdf)
Readings: See Slides

Apr 8 (Fri): R Lab 2 (Boca)

Apr 11 (Mon): Variation and human disease (Scharpf)
Slides: Association (pdf)
Readings: See Slides

Apr 13 (Wed): Multiple testing, false discovery rate, and artifacts (Leek)
Slides: Multiple Testing (pdf)
1. Storey JD and Tibshirani R (2003) Statistical significance for genomewide studies. PNAS 100:9440-5. Tusher

2. Leek JT and Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3: e161.

Apr 15 (Fri): R Lab 3 (Boca)

Apr 18 (Mon): Tiling arrays, ChIP-chip, background correction (Leek )
Slides: ChIP-chip (pdf)
Updated slides: Chip-chip (ppt)
Readings: See Slides

HW 3 - Due April 25th 10:30AM HW3

Apr 20 (Wed): Chip-Seq, Motif Finding (Leek)
Slides: ChIP-seq (pdf)
Readings: See Slides

Apr 22 (Fri): R Lab 4 (Boca)

Apr 25 (Mon): Flow cytometry (Leek)
Slides: Flow Cytometry (pdf)
1. Lo K, Brinkman R, and Gottardo R. (2008) "Automated gating of flow cytometry data via robust model based clustering." Cytometry A 73:321-332. Link
2.Hahne F et al. (2009) "Per-channel basis normalization methods for flow cytometry data" Cytometry A 77:121–131. Link

HW 4 - Due May 2 10:30AM HW4

Apr 27 (Wed): Sequencing - alignment (Langmead)
Slides: Sequencing (pdf)
Readings: See Slides

Apr 23 (Fri) Note the time and room have changed for this R Lab. It will occur at 10:30AM in the Genome Cafe (E607) : R Lab 5 - Flow cytometry (Jaffe)

Apr 27 (Wed): RNA-sequencing (Hansen)
Slides: RNA-Sequencing (pdf)
Readings: Oshlack, A. and Robinson MD and Young MD (2010) "From RNA-seq reads to differential expression results" Genome Biology 11:220 Link
+ References in slides

HW 5 - Due May 2 10:30AM HW5
May 4 (Wed): Epigenomics (Aryee)
Slides: Epigenomics (pdf)
Readings: See Slides

May 10 (Mon): Basic Proteomics (Leek)
Slides: Mass Spec (pdf)
1. Karpievitch YV, Polpitiya AD, Anderson GA, Smith RD, Dabney AR. (2010) Mass spectrometry-based proteomics: Biological and technological aspects. Annals of Applied Statistics, in press
2. Karpievitch YV, Stanley J, Taverner T, Huang J, Adkins JN, Ansong C, Heffron F, Metz TO, Qian W-J, Yoon H, Smith RD, Dabney AR. (2009) A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics, 25:2028-2034
3. References within slides

May 12 (Wed): Genetics of Gene Expression/Causal Genomics (Leek)
Slides: GGE (pdf)
Readings:See Slides

May 13 (Fri): R Lab 6 (Jaffe)

May 16 (Mon): Ancient elements in genomes (Wheelan)
Slides: Ancient elements in genomes (pdf)
Readings:See Slides

May 18 (Wed): Prediction and reproducibility (Leek)
Slides: Prediction (pdf)
Readings:See Slides