User: Anonymous User ( Login | Register )

Applications Note

ldSNP: Select optimal subsets of SNPs with maximum statistical power for the analysis of incomplete candidate gene association studies.


Most genetic association studies analyse only a small subset of SNPs from a gene or region. The ldSNP application employs a novel deterministic algorithm to identify minimal subsets of SNPs which maximize statistical power for SNPs not genotyped from the combinatorial search space of all possible subsets. In genes where validation by brute force searching was practicable, the algorithm correctly identified all optimal subsets.


Web accessible interactive application, (free registration required)

Contact: [email protected]


Genotyping laboratory costs prohibit the testing of all available single nucleotide polymorphisms (SNPs) in a candidate gene or region in screening association studies , so researchers are usually constrained to select a subset of SNPs.

At analysis of the genotyped subset of SNPs, any association signal from a SNP not genotyped will only be detectable because of the existence of linkage disequilibrium (LD), so LD is a candidate as a measure of utility for selecting small sets of SNPs which provide as much information as possible about SNPs not in the sets. A method and an algorithm for selecting subsets of SNPs based on the amount of LD they capture from SNPs outside the set is described. The search for optimal subsets takes place in the space of all possible subsets, given by 2G where G is the number of SNPs in the gene or region.

The application described here identifies optimal subsets of SNPs of varying size, allowing a subset of appropriate size and performance to be selected. The current implementation can process very large genes using commodity hardware reasonably quickly - for example ABO which has 164 SNPs, requires about 30 seconds to process on a desktop PC. This application allows the designer of an incomplete association study to explicitly balance genotyping cost against mean loss of effective sample size for SNPs not genotyped.


While many measures of the strength of LD are available, pairwise r2 is used here because it can be interpreted as the effective loss of sample size for detecting the effect of a causal SNP which was not genotyped by testing a SNP which was genotyped . Note that although this application is currently limited to pairwise LD, the principle remains the same for three way or higher dimensional LD measures.

Consider a gene or region containing a set {G} of SNPS of which only S can be selected for genotyping. Let LDtj be the pairwise r2 value between two SNPs, t E {G} and j E {G}, t E j. Conceptually, the optimization problem can be expressed as ranking all possible subsets of size S in order of the mean LD they capture with the SNP not genotyped.

Let one such subset be {S}. Let {T} be the set of SNPs in {G} but not in {S} (i.e., {T} = {G}\{S}). For each SNP t E {T}, define BestLDt as the largest pairwise r2 value between t and any SNP j E {S}:

For subset {S}, define the minimum of these as:

and define their mean as:

where nt is the number of SNPs in T

BestLDt is the worst case effective sample size deflation for any SNP in {G} but not in {S} if the SNPs in {S} are genotyped and analysed. Note that for all SNP in {S}, BestLDs is defined as 1.0 since these SNP will be genotyped. MinBestLDs is a criterion to rank the utility of any subset {S} {G}, in order to identify the subset which will give the greatest statistical power to detect the effect of a causal SNP not in {S}, assuming that each SNP in {S} is analysed one at a time. MeanBestLDs is also a useful criterion for ranking {S}, since this is the average loss of effective sample size over all SNP in {G} when {S} is analysed one SNP at a time. It is not a good criterion to optimize in practice because small {S} with high MeanBestLDs may have low values for MinBestLDs.



The implementation is in Python ( with the Psyco JIT python compiler for speed (, with a Zope ( web application front end. It has been tested by comparing the results with a brute force search for 20 genes ranging up to 20 SNP in size and correctly identified the optimal ldSNP sets in every case. The number of all possible subsets is 2G, so testing against brute force searching of becomes increasingly impracticable as the gene size G increases.

The application takes raw genotype data or LD matrices from the LDMAX utility distributed with the Gold LD viewer ( as input. A user specified filter cutoff for SNP rare allele frequency may be provided since in most real situations, statistical power will be limited for rare SNPs .



ldSNP sets are intended to maximise effective sample size for SNPs not genotyped in incomplete genetic association studies where analysis will be based on the SNP which have been genotyped. A practical algorithm for searching the combinatorial search space of all possible subsets described here appears to perform well when validated against brute force searching for modest sized genes.

Two other methods for explicitly capturing LD in subsets of SNPs have been recently described, one based on decomposition of the composite LD matrix using a rotated principle components approach and another based on maximizing Shannon entropy . Haplotype tag SNP sets (htSNP) which capture all haplotype diversity also capture LD in a sense, although through full haplotypes rather than as pairwise LD. The ldSNP application has the advantage of being unaffected by the complexities of statistical haplotype inference and haplotype block substructure.

This application was originally inspired by a report available at the SeattleSNPs data repository which uses a clustering approach to bin SNPs with similar r2 for one r2 threshold (0.64). That method does not appear to have been formally published, but the concept of using a threshold to identify subsets inspired the application reported here, which takes user supplied data in a number of formats and provides a range of choices for subset size and LD to suit the experimental design.

The approach described here is based entirely on the pairwise LD matrix and does not take putative function into account. In practice, the wise researcher would ensure that ldSNP sets were augmented with the most likely potentially functional variants such as non-synonymous coding SNPs sufficiently common to give adequate statistical power.


Individual subject ABO genotype data were accessed from the SeattleSNPs PGA website at Data from the SeattleSNPs website,

The use of a threshold LD value to find small sets of SNPs to capture LD was first suggested by Christopher Carlson as implemented in the LDClusters (TagSNPs) reports available for each gene at the above web site.



Supported by Programs for Genomic Applications, Grant U01 HL66795: Innate Immunity in Heart, Lung and Blood Disease, from the National Heart, Lung and Blood Institute.


Table 1

Optimal ldSNP sets of varying size found by sweeping MinBestLD from 0.5 to 0.9 in steps of 0.1

for ABO, >10% rare allele frequency SNPs (N=149) in African American subjects.

Raw data from the SeattleSNPs website, accessed October 28, 2003.










164, 322, 450, 974,1244, 2584, 2647, 3551, 10735, 16090, 18007





164, 322, 450, 690, 1035, 1244, 1852, 2566, 2584, 3340, 4404, 16090, 18007





322, 450, 497, 690, 1035, 1244, 1435, 1852, 2566, 2584, 3328, 3340, 4404, 16090





322, 450, 497, 690, 1035, 1244, 1852, 2566, 2584, 3220, 3338, 3340, 4404, 8080, 10863, 15913, 16090, 18007, 18059