Redefining comparative sequencing using high-density oligonucleotide arrays
Introduction
The first draft of the Human Genome was a monumental event in the history of science and its completion launched a whole new era in genetic research. The availability of a draft sequence for all human genes enables scientists to identify genes associated with disease, and in particular, the differences that lead to human disease. Today, scientists are faced with new challenges as they begin comparing the genomes of diseased and healthy individuals to decipher the underlying genetic variation causing a particular disorder. As researchers switch their focus from de novo to comparative sequencing experiments, they are faced with the dilemma of how to perform accurate and precise sequence analysis quickly and easily. The challenge of comparative sequencing is that it must be highly accurate and often requires the analysis of large amounts of sequence across hundreds of individuals, compared to de novo sequencing, which is usually employed on just a few individuals. In order for large-scale studies and the medical advances they promise to become a reality, new technologies will be required.The high-density microarray, traditionally used in gene expression studies, is one technology that scientists are turning to for its ability to deliver both the accuracy and the scalability that large-scale comparative sequencing requires. Over the past several years, Affymetrix GeneChip® technology has played a significant role in gene expression analysis by leveraging photolithographic manufacturing processes to create the high-density microarray. GeneChip arrays have allowed researchers to take whole-genome snapshots, rather than looking at one gene at a time, creating a new experimental paradigm and enabling research that was never before possible.
Affymetrix recently announced the new CustomSeq array program, allowing scientists to analyze DNA on the same proven GeneChip platform and instrumentation used for RNA analysis. One CustomSeq Array sequences up to 30,000 bases of double stranded DNA in only 48 hours. CustomSeq arrays deliver high quality, completed sequence with minimal assembly and curation required, enabling researchers to perform large-scale comparative sequencing more efficiently, while reducing the overall cost per base. By leveraging the same economies of scale that facilitated whole genome expression analysis, CustomSeq arrays are positioned to move the field of comparative sequencing forward.
Capillary sequencing
Gel based or capillary sequencing has enabled de novo applications. But, the technology is limited in meeting the needs of many large-scale comparative sequencing projects, which demand a higher level of accuracy and reproducibility than de novo sequencing applications. Additionally, resequencing large stretches of DNA in several hundred samples is simply too onerous for the average lab to do by traditional capillary methods due to the huge effort involved in manual editing and sequence assembly. This work does not scale well. With traditional capillary sequencing methods, a long strand of DNA is "chopped" into small fragments of about 500 base pairs and then reassembled during data analysis. A 30 kb region would require the amplification, purification, and assembly of approximately 120 sequencing reactions per sample, a challenging prospect for many laboratories. For these reasons, many projects of interest to scientists have sat on the shelf, awaiting advances in technology that will make them cost-effective. CustomSeq's capacity for large-scale sequencing will enable researchers to pursue many applications that were simply impractical using older sequencing methods. These new applications include:Genetic mapping experiments
Geneticists have relied on linkage analysis in family populations to identify genes associated with a particular disease or phenotype. This methodology utilizes a low resolution panel of genetic markers to track the passage of chromosomal segments from one generation to the next, providing clues linking the inheritance of a particular genetic region to the inheritance of a particular trait. Typically, sequential "mapping" experiments are required using marker panels of increasing density to narrow the region as much as possible. This region can then be cloned and resequenced across multiple individuals to locate the disease-associated SNP. Over the years, many linkage analysis experiments have successfully identified chromosomal regions associated with a particular disease. Yet in many cases, the disease causing mutation remains unknown because of the large amount of work involved in resequencing those regions.As high density SNP assays become available, scientists will, for the first time, apply whole genome association studies and haplotype analysis to "map" regions in unrelated individuals. As in linkage mapping, scientists would like to follow up by resequencing multiple individuals to identify the mutations associated with the disease. Nucleotide analysis of chromosomal regions is often considered too much work by traditional sequencing methods even in cases where a few candidate genes are identified within these regions. CustomSeq arrays deliver 30,000 bases of double stranded sequence in a single hybridization, facilitating large scale resequencing of large regions identified in mapping experiments.
Candidate gene analysis
Another key application for comparative sequencing is candidate gene analysis. These studies begin with the hypothesis that a specific gene or set of genes may be involved in a particular disease. Genes are typically selected based on evidence from previous whole genome mapping or expression experiments. Today, candidate gene studies sometimes involve complete sequence analysis but are often limited to genotyping only a small set of known mutations in a particular gene or set of genes. By limiting the analysis to only known SNPs, it is possible to miss important sequence variations because the SNP is rare or previously unknown. Using CustomSeq arrays, scientists are able to broaden the experimental scope to analyze complete information for multiple genes in a single experiment, ensuring a complete view of the sequence variation and increasing the confidence of identifying disease associated SNPs.
Pharmacogenetic studies
Pharmacogenetic studies represent a new application for comparative sequencing that is of growing interest to biopharmaceutical companies. Understanding the genetic variation within a clinical population may provide valuable insight into differences in therapeutic efficacy or toxicity. Biopharmaceutical companies have begun analyzing genetic variation in the target and pathway genes in order to characterize the genetic basis for differential responses to specific drugs. For some drug targets, complete routine sequence analysis is desirable due to the frequency of spontaneous mutations or variety of known mutations in that particular disease. In these studies, CustomSeq arrays facilitate the comparison of data across populations by providing accurate and reproducible data.
Pathogen subtyping
Typically pathogens are detected using an immunoassay, followed by in-depth sequence analysis to determine the particular strain or subtype. Due to the high rate of mutagenesis and the complexity of strain variation, mutation analysis of microorganisms requires technology that both genotypes for known mutations and identifies novel SNPs. CustomSeq arrays allow researchers to analyze whole sequence and identify known and new mutants that may contribute to the pathogenicity or therapeutic resistance of an organism.Array based sequencing is not a new concept; the evolution of this technology is chronicled in publications dating back to 1993.(1-12) GeneChip arrays have successfully been applied to understanding the impact of p53 mutations in lung and bladder cancers (3-6) as well as being used to dissect the role of genetic variation in the cytochrome 450 enzymes as they relate to the metabolism of antidepressants.(7) Recently, sequencing methodology and throughput has begun advancing to a performance level adequate for the applications described above. GeneChip technology has progressed through advances in array manufacturing and assay protocols to provide a robust, reproducible and highly accurate method for sequencing DNA. The best example of the state of the GeneChip technology is the recent work of Perlegen Sciences, Inc., an Affymetrix affiliate. Using whole, undiced wafers, scientists at Perlegen resequenced 50 human genomes in just 15 months, a feat inconceivable with capillary sequencing (reference the Perlegen paper in Science or Nature). The CustomSeq array program provides researchers access to this powerful technology, bringing large-scale comparable sequencing to their benchtop.
Using CustomSeq
CustomSeq arrays use the same proven principle of allele-specific hybridization that has made Affymetrix the industry standard in gene expression to resequence DNA. For each base in a sequence Affymetrix tiles four probes, representing each of the four possible nucleotides: A, C, G and T. The sample is then labeled with fluorescent molecules, and the feature that "lights up" on the CustomSeq array when the sample is run across the array indicates the base that is present at that position. This technology allows both complete sequence and complete genotype analysis in a single experiment. If there is no SNP present at the position, only one feature of the four will light up. If there is a heterozygous SNP at the position, two features of the four will light up. Both known and novel SNPs can be detected unambiguously. (Figure 1.)Researchers choose exactly what sequence they want to put on the array. Any haploid or diploid organism may be tiled and the sequence can represent a combination of contiguous genomic regions or multiple dispersed sets of exons. Once the sequence is selected and submitted to Affymetrix in the correct format, arrays are shipped to customers in less than eight weeks.
Figure 2 describes the recommended CustomSeq Array protocol. The assay begins with amplification of genomic DNA using standard PCR strategies. Resulting amplicons are then pooled in equimolar concentrations and fragmented prior to end labeling with biotin. Next, the labeled DNA is hybridized to the array and scanned using an Affymetrix scanner. Data is analyzed using GeneChip DNA Analysis Software (GDAS).
Benefits
Efficiency- time to resultsThe most dramatic difference between array and capillary sequencing is the length of time it takes to produce a completed sequence. CustomSeq arrays facilitate data analysis by providing "read lengths" of up to 30 kilobases long, minimizing the amount of alignment and assembly required and significantly reducing the time associated with this tedious step. Additionally, there is little ambiguity in CustomSeq's results as the Affymetrix software automatically calls sequence and genotypes including heterozygotes. Alignment and manual editing required with capillary sequencing can take weeks to months to complete, depending on the quality of the data. Using a CustomSeq array, one can obtain complete sequence information for 60 kilobases of DNA in just 48 hours. Furthermore, array-based sequencing is only at the beginning of the technology curve. Shah et.al.(13) present proof of principle experiments showing that feature size can be reduced to 18 microns (the same feature size as used on Affymetrix's latest gene expression arrays) without reduction in assay performance, increasing the capacity of a similarly sized array to 60 kb of double stranded sequence (120 kb total). Continued improvements in arrays, such as reductions in feature size and improved manufacturing efficiencies, will further widen the gap between array and capillary sequencing.Accurate and reproducible results
CustomSeq arrays provide automatic base calls at >99.99;% accuracy, setting a new standard for large-scale comparative sequencing.(11) Several recent publications demonstrate the ability to achieve highly accurate and reproducible results using allele specific hybridization.(11-13) CustomSeq arrays sequence both strands of DNA simultaneously, generating data at a higher confidence level than that obtained by just sequencing a single strand. Furthermore, Affymetrix' parallel manufacturing method and stringent quality control ensures >99.99;% reproducibility between arrays. Dideoxy sequencing is subject to far greater variability because of differences in user expertise, chemicals and purity of the gel matrix.Cost effective
Detailed analysis of array based sequencing costs show that high quality data can be achieved for less than a penny per base. Dideoxy sequencing costs vary greatly between institutions and depend largely on the throughput of an individual laboratory. Sequencing reagent costs have significantly reduced over the past few years, but the costs of PCR amplification, purification and labor costs involved in sequence assembly remain substantial for comparative analysis of large genomic regions.Implementing long-range amplification strategies further improves the cost effectiveness of arrays by reducing the number of PCR reactions one must carry out to amplify tens of kilobases of genomic sequence. Warrington et. al.,(12) implemented long range amplification to sequence more than 200 kilobases in 40 individuals for a total of 8.4 megabases. This publication also describes automation strategies to increase the throughput of the GeneChip system to 1.4 megabases per day per two technicians.
The larger the study, the more efficient array-based sequencing becomes. Resequencing longer contiguous regions requires far fewer PCRs and far less assembly and curation compared to capillary sequencing methods. Instead of spending time cleaning and fixing data, researchers can move on to finding meaning in their results.
Conclusions
CustomSeq arrays are redefining large-scale comparative sequencing by giving researchers more information that's more reliable and more cost effective. Just as Affymetrix GeneChip technology revolutionized large-scale gene expression analysis, CustomSeq arrays are setting a new standard in resequencing. By providing researchers with an efficient tool for comparative sequencing, CustomSeq arrays will facilitate the discovery of mutations involved in humanities most prevalent diseases.About the author
Anna Berdine, the Product Manger for DNA analysis at Affymetrix Inc., received her Masters in Biotechnology from Northwestern University and a Bachelors in Biochemical Pharmacology from the State University of New York at Buffalo.More information is available from: