User: Anonymous User ( Login | Register )


Advanced techniques in biostatistics and bioinformatics are often required in the investigation of complex human diseases because analysis must not only address the effect of a large number of etiological determinants (as in conventional epidemiology) but must also take proper account of complex biological models. Variation discovery programs are expensive and expert statistical input to study design and analysis are crucial in order to ensure that efficient use is made of limited resources. The widespread availability of SNP data has generated many new issues in statistical analysis [Palmer and Cookson, 2001].

Our goal is twofold. First, to perform analyses with the sequencing and SNP genotype data that will provide useful complementary information to other investigators regarding the SNPs identified in our candidate genes. Secondly, to develop new tools and resources relevant to SNP genotyping through a program of methodological development in bioinformatics and biostatistics that will dynamically interact with the applied research component.

1. Analysis of DNA sequence data

Following definition of SNPs within each sequence sample (see Sequencing Notes), Hardy-Weinberg equilibrium was tested at each SNP locus on a contingency table of observed versus predicted genotype frequencies using a modified Markov-chain random walk algorithm [Guo and Thompson, 1992]. Pairwise linkage disequilibrium between each pair of SNP loci was analyzed using a likelihood-ratio test, whose empirical distribution was obtained by a permutation procedure [Slatkin and Excoffier, 1996].

Maximum likelihood haplotype frequencies were imputed within each group of subjects sequenced (European-American, Hispanic-American, African-American) using an Expectation-Maximization (EM) approach [Excoffier and Slatkin, 1995], as implemented in the program Arlequin v2.0. The likelihood of the observed data D given the haplotype frequencies p under this approach is

where the sum is over all n individuals in the sample, the product is over all possible genotypes of the n individuals, and . The EM algorithm was repeated from 20 different starting points. Standard deviations were estimated using a parametric bootstrap procedure.

Oracle v8i, SAS and Arlequin v2.0 were used to manage and analyze the data. P-values were derived by empirical simulation where possible. Statistical significance was defined at the standard 5% level.

2. Analysis of SNP genotype data: General considerations

Specific analytic considerations relevant to the analysis of SNP genotypes in each gene are considered in specific statistical analysis notes on the page for each gene.

2.1 Introduction: Statistical analysis of marker associations

Historically, association analysis of genetic polymorphisms has been most often performed in a case-control setting, with unrelated case subjects compared to unrelated control subjects. Significant differences in allele frequencies between cases and controls are taken as evidence for involvement of an allele in disease susceptibility. Case-control analysis is thus a well-validated technique for the discovery of alleles associated with human disease susceptibility [Khoury et al., 1993].

One of the limitations of linkage analysis is the difficulty of fine mapping the location of a gene influencing a complex disorder. There are not usually enough meioses within 1-2 megabases of the disease gene to detect recombination events; moreover, with the effects of phenocopies and genetic heterogeneity in complex diseases, critical recombination events cannot be identified with certainty. Rather than using family data, we will use population data to refine the position of a disease gene through the use of linkage disequilibrium and allelic association. An advantage of using population data is that patterns of linkage disequilibrium are the result of recombinations that have occurred in the past generations and therefore effectively increase the recombinant sample. There is also some evidence that marker-disease association studies are generally more powerful than transmission disequilibrium-based tests [Long and Langley, 1999]. Linkage disequilibrium mapping has been applied to many simple monogenetic (Mendelian) human traits with success [Jorde, 1995], although many properties of the technique have not been studied extensively. Applications and extensions of such mapping approaches to more complex human traits are in an early stage of development [Olson et al., 1999; Risch, 2000]. This is an important area of methodological research, as the use of linkage disequilibrium mapping in appropriate populations may represent a means of dealing with some of the complexities and difficulties associated with standard mapping approaches to complex genetic traits [Lander and Schork, 1994; Terwilliger and Goring, 2000]. Research aimed at developing strategies for mapping susceptibility loci for complex traits is thus critical to the success of current gene discovery projects in heart, lung and blood disease.

This section describes both the applied statistical analyses we are undertaking on the data produced by the PhAT, in addition to the ongoing methodological research in bioinformatics and biostatistics that we are engaged in. We anticipate that these two components will synergistically interact with each other throughout the life of the PhAT.

2.2. Descriptive analyses

Standard descriptive statistics for case and control populations, including allele frequencies in each ethnic group, are generated using the database system and a number of statistical packages and are available on the web site. Hardy-Weinberg proportions and pairwise linkage disequilibrium between SNPs are tested as for the sequence data (see above). Calculation of Hardy-Weinberg Equilibrium (HWE) also serves as a crude quality check on the data; experience suggests that gross deviations from HWE often indicate genotyping errors or population admixture.

2.3 Case-control SNP-disease association analyses

We will use a variety of techniques to investigate SNP-disease associations. These will include both single SNP associations and extended analyses of haplotypes and clusters of multiple SNPs.

2.3.1 Single SNP-disease association analyses

Our primary case-control analyses are based upon the use of contingency tables and unconditional logistic regression [Hosmer and Lemeshow, 1989] to estimate the age and sex-adjusted relative risk (approximated by odds ratios [OR]) attributable to each SNP within each case-control ethnic group (n=70 cases, 140 controls). Having genotyped the study population at the candidate loci, observed genotypes, age and sex will enter the analysis as covariates associated with fixed effects (regression coefficients) that will reflect the magnitude of the effect of the polymorphism on the phenotype being studied (e.g. asthma status).

2.3.2 Haplotype analysis

Haplotype analyses will be undertaken as for the analysis of sequence data (see above).

2.4. Empirical P-values

It is important to restrict the type 1 error in genetic studies and to account for multiple testing issues [Lander and Kruglyak, 1995]. Global levels of significance for SNP-disease associations within ethnic and case-control groups (conditional on the entirety of the actual data, including SNP genotypes) are determined empirically using permutation of the SNP genotypes and Monte-Carlo simulation via a Metropolis algorithm.

In addition to asymptotic P-values, all P-values derived for the specific analyses of association/linkage disequilibrium and haplotype/phylogeny of the data produced by the PhAT are also calculated empirically using permutation and Monte-Carlo simulation in order to give some impression of the plausibility of individual results.

2.5. Statistical power

Our primary scientific aim is to test for association with SNPs in 50 selected genes that are already strong candidates for having a functional role in determining non-cognate immunity-mediated disease risk. However, our genotyping component is not designed to definitively discover genes having functional effects on disease risk for the four diseases studied, but rather to narrow the number of potentially important SNPs from the approximately 5,000 we anticipate discovering within the 100 candidate genes investigated. Given this goal, rather than using the conventional P<0.05 threshold for significance, we will use a more liberal ‘screening’ threshold of P<0.10. We anticipate that, given our assumptions, this will allow us to exclude the majority of the SNPs discovered from having important associations with disease risk (see ‘Statistical power’ below). However, we will also calculate global empirical p-values using Monte Carlo algorithms to adjust for multiple comparisons with all SNPs in order to provide a more accurate characterization of the probability for a true association.

Power calculations assume a polymorphism that operates as if it was a simple binary exposure to which a proportion of the population are exposed directly proportional to the allele frequency (e.g. for 10% exposure, this is equivalent to a dominant allele at HWE with an prevalence of 5.1% or a recessive allele at HWE with an prevalence of 31.6%).

Power calculations suggest that within each case-control ethnic group for each of the five disease groups (n=70 cases and 140 controls in each group), we will have around 80% power to detect an odds ratio (OR) of around 2 for a case having at least one copy of a SNP (Table 1) at α=0.1 and for allele frequencies of a disease-associated dominant SNP of >10% and of a disease-associated recessive SNP of >30%. The cells in Table 1 illustrate the detectable difference profile over a range of exposure prevalences.

By accepting a lower threshold for α (type 1 error probability) of 10%, our study will have around 80% power to detect genes with a prevalence of at least 20% or greater resulting in an increase in relative risk of disease of at least 1.81 (for a dominant SNP) to 2.98 (for a recessive SNP) (Table 1). As the primary goal of most research into the genetic basis of the common, complex heart, lung and blood diseases we are studying is to discover common variants of moderate/large effect which may be important in the general population (i.e., explain a substantial proportion of the etiological fraction), we anticipate having adequate statistical power to accomplish our goal of narrowing the number of potentially important SNPs within the candidate genes investigated. Candidate SNPs having an OR consistent with a power to detect the difference of at least 80% will be highlighted on our web site as candidates for further investigation by other researchers.

Table 1. Detectable difference (odds ratio) for case-control analyses of 70 cases and 140 controls (power=80%, α=0.1).\
Allele frequencya Exposure (Dominant SNP)b

Detectable OR (Dominant)c

Exposure (recessive SNP)b

Detectable OR (Recessive)d

10% 19% 1.95 1% 5.47
20% 36% 1.81 4% 2.98
30% 51% 1.84 9% 2.31
40% 64% 1.96 16% 2.02
50% 75% 2.22 25% 1.88
60% 84% 2.82 36% 1.81

a Allele frequency in cases.

b Exposure (=prevalence) in cases assuming a diallelic locus with a dominant or recessive allele at Hardy Weinberg equilibrium.

c Detectable OR between cases and controls for possession of at least one copy of disease-associated SNP by case.

c Detectable OR between cases and controls for possession of two copies of disease-associated SNP by case.

These power calculations are simple, as true power to detect functional association and linkage disequilibrium may depend upon: the prevalence of the mutant allele; the recombination fraction between mutant allele and marker; the size of the effect of the mutant allele on phenotype; type of study population; and the penetrances of the functional locus genotypes [Weeks and Lathrop, 1995]. Further, the power calculations are only based on single SNP-disease association analysis; multi-locus analysis will also be undertaken.

Multiple testing issues are ignored in these calculations. It is somewhat unclear how many independent tests are actually being performed; the candidate genes investigated will not be independently/randomly chosen, but rather will be chosen on the basis of potential pleiotropy across several diseases (i.e., potentially commonality to a common causal pathway). In addition, it could be argued that the choosing of genes also chooses a linkage disequilibrium region; this implies that only 50 independent tests are being undertaken for the single SNP-disease association analyses. More certainly, haplotype analyses within genes will likely only represent 50 independent tests. Finally, the use of empirically-determined global significance criteria will at least partially mitigate issues of multiple testing.

The power calculations given in Table 1 are conservative, as we have a policy of replication of positive findings within independent datasets available within the Channing Laboratory. In addition, an effect corresponding to a binary exposure is the least powerful influence that a SNP could have, and at any given exposure prevalence the estimated detectable difference is a conservative minimum. Finally, haplotype approaches will be used in addition to the single SNP-disease approach used in these power calculations. Multilocus approaches are likely to be uniformly more powerful than single SNP association tests [Collins and Morton, 1998].

2.6. Inter- and intra-ethnic-group variability

Inter- and intra-ethnic-group variability will be initially examined with regard to both allele frequencies and strength of associations in single SNP analyses. Variability will also be investigated with regard to differences in linkage disequilibrium and haplotype patterns between ethnic groups.

2.7. Methodological development in biostatistics and bioinformatics

An important function of the PhAT, which interacts with all of the other components, is to conduct methodological research in genetic statistics and bioinformatics. This will provide new tools and resources that will allow researchers in the heart, lung and blood scientific community to improve the efficiency of their research into specific diseases. There are several primary areas of research that we will address over the next 4 years. The areas detailed below are not exhaustive, and it is certain that we will be led in new directions as we become involved in particular areas of collaborative research and as laboratory technology advances. We anticipate that these areas will have broad significance for the field of genetic epidemiology and bioinformatics, and will be directly applicable to the genotyping studies of discovered SNPs.

There are several issues which concern the use of linkage disequilibrium to map heart, lung and blood diseases. These include the correlation between disequilibrium and physical distance and multiple testing problems arising from the use of polymorphic markers. Jorde and colleagues [Jorde et al., 1994] investigated the relationship between linkage disequilibrium and physical distance, and demonstrated that linkage disequilibrium does not correlate well with physical distances in genomic regions less than 50 kb. Further, the extent and implications of inter- vs. intra-population variability are poorly defined. Each of these areas will be subject of ongoing methodological research by Drs Palmer and Kohane and their collaborators using MCMC, composite likelihood approaches, phylogenetic analysis, automated classification techniques and other methods. We will use both extensive simulated data and SNP data generated both by the PhAT itself and by other SNP discovery groups.

2.7.1 Study design

An important aim of the methodological research will be to develop novel study designs that will facilitate linkage disequilibrium mapping within various types of populations, and investigate the relative power and efficiency of these designs. For instance, information generated regarding inter- and intra-population variability (heterogeneity) should allow more efficient studies of the asthma, COPD, MI and DVT to be designed. The properties and power of different study designs under various heterogeneity scenarios will be investigated using both PhAT data, data from the human genome project and extensive simulated data.

2.7.2 Extension of bioinformatics techniques to SNP analysis

Bioinformatics techniques for examining clustering and associations within complex datasets (machine learning) currently in active development in Dr Kohane’s group at CH will be adapted for use in analyzing SNP data of the sort produced by the PhAT. These include both unsupervised and supervised learning techniques.

2.8 Statistical software

We anticipate that one of the outcomes of our methodological research program, which will be an important part of our education and dissemination mission, will be the production of software and programs for genetic and bioinformatic analysis that will be disseminated via the internet, e.g., WinBUGS files for cladistic analysis. Professor Robert Elston, a consultant to the PhAT, has a great deal of experience in the production and dissemintion of software for genetic analysis (his group produces the most widely used software package for genetic analysis in the world, the S.A.G.E. program . In addition to our own production and dissemination of software, it is possible that some of our methodological innovations will be directly incorporated into S.A.G.E. Software produced by our Center will be guided by the following principles: portability to major computer architectures, comprehensive and clear documentation, and user friendliness. Software will be distributed via the PhAT website.

LJ Palmer
18 December 2001

References cited
  • Collins A, Lonjou C, Morton NE. Genetic epidemiology of single-nucleotide polymorphisms. Proc Natl Acad Sci U S A 1999;96:15173-15177.
  • Collins A, Morton NE. Mapping a disease locus by allelic association. Proc Natl Acad Sci U S A 1998;95:1741-1745.
  • Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995;12:921-927.
  • Guo SW, Thompson EA. Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 1992;48:361-372.
  • Hosmer D, Lemeshow S. Series in Probability and Mathematical Statistics. Wiley: New York, 1989.
  • Jorde L. Linkage disequilibrium as a gene-mapping tool [editorial; comment]. Am J Hum Genet 1995;56:11-14.
  • Jorde LB, Watkins WS, Carlson M, Groden J, Albertsen H, Thliveris A, Leppert M. Linkage disequilibrium predicts physical distance in the adenomatous polyposis coli region. Am J Hum Genet 1994;54:884-898.
  • Khoury M, Beaty T, Cohen B. Fundamentals of genetic epidemiology Oxford University Press: Oxford, 1993; p 383.
  • Lander E, Kruglyak L. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nature Genet 1995;11:241-247.
  • Lander E, Schork N. Genetic dissection of complex traits. Science 1994;265:2037-2048.
  • Long AD, Langley CH. The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res 1999;9:720-731.
  • Olson JM, Witte JS, Elston R. Genetic mapping of complex traits. Stat Med 1999;18:2961-2981.
  • Palmer LJ, Cookson WOCM. Using Single Nucleotide Polymorphisms (SNPs) as a means to understanding the pathophysiology of asthma. Respiratory Research 2001;2:102-112.
  • Risch NJ. Searching for genetic determinants in the new millennium. Nature 2000;405:847-856.
  • Slatkin M, Excoffier L. Testing for linkage disequilibrium in genotypic data using the Expectation-Maximization algorithm. Heredity 1996;76:377-383.
  • Terwilliger JD, Goring HH. Gene mapping in the 20th and 21st centuries: statistical methods, data analysis, and experimental design. Hum Biol 2000;72:63-132.
  • Weeks D, Lathrop G. Polygenic disease: methods for mapping complex disease traits. TIG 1995;11:513-519.