Welcome
Background
"Multiple algorithms have been developed for the purpose of calling single nucleotide polymorphisms (SNPs) from Affymetrix microarrays." (Shin Lin, Benilton Carvalho, David J Cutler, Dan E Arking, Aravinda Chakravarti and Rafael A Irizarry: Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biology 2008, 9:R63). Out of all the algorithms, how can we decide which one to use? I have compared two of the leading algorithms currently available (open source) and tried to find out which one of them is more accurate and can be used over a wide variety of Affymetrix platforms. First of all, I would like to address some FAQs regarding SNP genotyping:
What is a Single Nucleotide Polymorphism (SNP)? "Single nucleotide polymorphisms or SNPs (pronounced "snips") are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a SNP might change the DNA sequence AAGGCTAA to ATGGCTAA. For a variation to be considered a SNP, it must occur in at least 1% of the population. SNPs, which make up about 90% of all human genetic variation, occur every 100 to 300 bases along the 3-billion-base human genome. Two of every three SNPs involve the replacement of cytosine (C) with thymine (T). SNPs can occur in both coding (gene) and noncoding regions of the genome. Many SNPs have no effect on cell function, but scientists believe others could predispose people to disease or influence their response to a drug." (http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.shtml)
What is genotyping?
"Genotyping refers to the process of determining the genotype of an individual by the use
of biological assays."(Wikipedia) One of the methods of determining genotypes is using Microarrays. The mRNA/cDNA
(usually tagged with a fluorochrome) of an individual is hybridized to a microarray chip which consists of thousands of
SNPs imprinted on the chip. The resulting hybridizing intensity lets us determine the genotype of that individual.
Why do all this?
"Single Nucleotide polymorphisms (SNP) microarrays represent a key technology allowing for
the high throughput genotyping necessary to assess genome-wide variation and conduct association studeis." (Shin Lin,
Benilton Carvalho, David J Cutler, Dan E Arking, Aravinda Chakravarti and Rafael A Irizarry: Validation and extension
of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biology 2008, 9:R63) A genome wide
assocciaton study uses markers (which could be SNPs) present in the genomes of individuals to identify variations associated
with a disease. After identifying the genetic variations, scientists can develop techniques to treat these diseases.
(http://www.genome.gov/20019523)
What do the SNP calling algorithms do?
The SNP calling algorithms give calls for each SNP for each individual.
For a diploid individual, the call can be - AA, AB or BB. As it is obvious, AA and BB are homozygous calls while AB means
that the individual is heterozygous for that SNP. These calling algorithms also give a confidence score for each call they
make. The purpose of this is to understand which SNPs to drop from futher analysis. (Shin Lin, Benilton Carvalho,
David J Cutler, Dan E Arking, Aravinda Chakravarti and Rafael A Irizarry: Validation and extension of an empirical Bayes
method for SNP calling on Affymetrix microarrays. Genome Biology 2008, 9:R63)
How can we figure out which algorithm works better?
The next step after getting the calls is to know whether the
calls can be trusted or not. In other words, we need to decide if the calls made by the algorithm are correct. Although till
date, there is no algorithm which can be trusted perfectly,we still need to figure out which amongst the ones we have currently
works best. To do this, I have done several types of analyses:
Accuracy vs Drop rate plots (ADP)
Accuracy vs Confidence measure plots
Cluster Plots (log 2 A vs log 2 B)
Accuracy vs Shift
Accuracy vs Signal to Noise Ratio (SNR) plots
For ADP plots, "each point in the graph represents the proportion of calls above a given threshold in agreement with the HapMap project." (Shin Lin, Benilton Carvalho, David J Cutler, Dan E Arking, Aravinda Chakravartiand Rafael A Irizarry: Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biology 2008, 9:R63). The ADP plots have been made on different samples.
For accuracy vs Confidence measure plots, the drop rate has been replaced by confidence metric thresholds. The accuracy vs confidence plots help us show variabilty of accuracies across datasets for each algorithm.
The cluster plots plot log A vs log B (base=2) and can show that different SNPs can have different distributions. (Carvalho et al : Exploration, Normalization and Genotype calls of high density oligonucleotide SNP array data. Biostatistics 2007 Apr;8(2):485-99). These plots help compare the genotype calling between two different algorithms.
For Accuracy vs Shift plots, accuracies were plotted against shifts (for each genotype) coming from the spline correction step in the CRLMM algorithm. For more information about "shift", see Carvalho et al.
The Accuracy vs SNR plots plot for each array the average accuracy vs the computed Signal to Noise Ratio (SNR) of the array. For more information about how SNR is computed see Carvalho et al.
In addition, I have also plotted M vs S plots to show how the quality of arrays can affect genotyping. In these plots M= log (A/B) and S= (log A + log B)/2