User: Anonymous User ( Login | Register )

SNP discovery by DNA sequencing begins with genomic DNA samples from a set of human subjects. These samples are quantitated, standardized for concentration, arrayed into 96 well storage plates, and replicated to 96 well PCR plates. PCR reagents are then added to the PCR plates containing genomic DNA, after which the plates are thermal cycled. The PCR reactions are then cleaned, and serve as the template for cycle sequencing reactions, each PCR reaction is sequenced in both the forward and the reverse direction in separate reactions. The sequence data is then analyzed to identify polymorphisms and to assign genotypes to individual DNA samples.

Genomic DNA Sample Tracking

One area of qc deals with genomic DNA sample mix-up. Several layers of controls are built into genomic DNA handling. Genomic DNA samples, consisting of individual microfuge tubes are ordered into a tube holder which is in an identical 8 row X 12 column array as are the PCR reaction plates. This process is performed by two technicians, with each step checked off by both technicians. This process is done only once, the tubes are left in place in the holder until the DNA supply is exhausted, thus minimizing errors in genomic DNA placement and allowing any mistakes to be easily be traced back to the source array. Within this array of genomic DNA samples, at least 2 assymetrically placed blank tubes containing water are included. These water blanks serve as ?sentinels? for cross contamination of DNA, as well as indicators of plates that have been inadvertently rotated 180o, such that the top left becomes the bottom right. Data from these ?sentinel wells? are examined after each sequencing run to ensure that they are indeed blank. Both the movement of genomic DNA from individual tubes to the 96 well storage plates and the movement of genomic DNA from the 96 well storage plates to the 96 well PCR reaction plates is performed by an automated liquid handling robot. In each case, the addressing of the array that was initially set up using individual tubes of genomic DNA is maintained. Thus there is a negligible chance for individual samples being switched, and the identity of a single well of a 96 well PCR reaction plate is easily traceable back to the original tube of genomic DNA. This maintenance of the addressing integrity is carried forward as well. Following PCR, the PCR cleanup, cycle sequencing setup, cycle sequencing cleanup, and sequencing electrophoresis is performed robotically in 96 well plates, maintaining the original addressing scheme.

Sequence, Polymorphism, and Data Tracking

At the point of electrophoresis, individual samples are "reunited" with their identity in the form of a sample sheet. The sample sheet is an entity within the sequencer control software which assigns a specific chromatrogram file name to a specific well position of a 96 well plate to be electrophoresed. This identity is predefined by software which automatically generates the sample sheets based on the sample addressing scheme of the original array of genomic DNA tubes.

Sequence chromatogram files generated by the sample sheet are named according to a very specific format. The first 5 characters identify the gene from which the PCR reactions were designed. The next three characters identify the particular amplicon within that gene, and whether the forward or reverse sequencing primer was used in that particular reaction. The next 4 characters contain the human subject identification of the DNA sample for that chromatogram.

At the conclusion of a sequencing electrophoresis run, automated scripts move the chromatogram files from the sequencing machines' computers to a central server. These scripts examine the specific file names in order to place the files in the appropriate directory for a particular gene and amplicon.

Sequence and Genotype Quality Control

Once on the server, the chromatogram files are processed by Phred and Phrap to base-call and assemble the sequences. Assemblies are processed with !PolyPhred to identify potential polymorphism sites and assign genotypes automatically. A minimum quality value of 20 is used in !PolyPhred. Potential polymorphisms are visually inspected to confirm or reject them. Criteria for confirmation are

  1. A subjective determination of a valid heterozygote peak, including minimal noise, a reduction in a consensus peak height coupled with the appearance of a non-consensus peak AND

  2. Confirmation of a heterozygote at that site in a second file, either the same amplicon in the opposite direction, the same site in an independent overlapping amplicon, or the same heterozygote at the same site in a different human subject.

Genotypes for all samples are visually inspected and confirmed. In cases where two or more sequence chromatograms are in conflict as to a particular genotype for a particular human subject, the analyst will resolve the conflict by applying an overriding tag to the incorrect genotype only if the reason for the incorrect genotype is obvious, for example a small noise peak under a true homozygous peak being incorrectly identified as a heterozygote. If there is no obvious explanation for the conflicting genotypes then all data for that subject at that site is removed from the analysis.

Our criteria for considering a gene to be complete are:

  1. >90% of all possible genotypes have been assigned. (all possible genotypes = number of polymorphisms X number of subjects.) AND

  2. No individual polymorphism has < 80% of the genotypes assigned.

Final genotype data are output from !PolyPhred to a postgreSQL database via a series of perl scripts. Correct assignment of the genotypes to the human subject is accomplished because of the strictly formatted chromatrogram file name structure. The software scripts use regular expression parsing to associate a genotype with the subject identifier, which always occurs at the same position in the file name.

The quality control of genotype accuracy is a function of 1) the quality threshold (20) of !PolyPhred, which filters out poor sequence quality, 2) the inherent accuracy of DNA sequencing and 3) sequencing coverage redundancy. Because of the redundancy inherent in sequencing each amplicon in both directions, as well as the occurrence of overlapping amplicons, the majority of genotypes have confirmation in an additional reaction.