To my family List of Publications This thesis is based on the following publications, which are referred to in the text by their Roman numerals. I Lovmar L, Ahlford A, Jonsson M, Syvänen A-C. “Silhouette” scores for assessment of SNP genotype clusters. BMC Genomics. (2005) 6:35. II (a) Chen D*, Ahlford A*, Schnorrer F*, Kalchhauser I, Fellner M, Viràgh E, Kiss I, Syvänen A-C, Dickson BJ. Highresolution, high-throughput SNP mapping in Drosophila melanogaster. Nature Methods. (2008) 5(4),323-9. (b) Schnorrer F*, Ahlford A*, Chen D*, Milani L, Syvänen AC. Positional cloning by fast-track SNP-mapping in Drosophila melanogaster. Nature Protocols. (2008) 3(11),1751-65. III Ahlford A, Kjeldsen B, Reimers J, Lundmark A, Wolff A, Romani M, Syvänen A-C, Brivio M. Dried reagents for multiplex genotyping by tag-array minisequencing to be used in microfluidic devices. Analyst. (2010) Jul 29. (Published online ahead of print) IV Kiialainen A, Karlberg O, Ahlford A, Sigurdsson S, LindbladToh K, Syvänen A-C. Performance of microarray and liquid based capture methods for target enrichment for massively parallel sequencing and SNP discovery. Submitted manuscript. * These authors contributed equally to this work Reprints were made with permission from the respective publishers. Additional publication by the author Li J.B, Gao Y, Aach J, Zhang K, Kryukov GV, Xie B, Ahlford A, Yoon JK, Rosenbaum AM, Zaranek AW, LeProust E, Sunyaev SR, Church GM. Multiplex padlock targeted sequencing reveals human hypermutable CpG variations. Genome Research. (2009) 19(9),1606-15. Supervisors Prof. Ann-Christine Syvänen, Department of Medical Sciences, Uppsala University, Uppsala, Sweden Associate Prof. Mats Nilsson, Department of Genetics and Pathology, Uppsala University, Uppsala, Sweden Chair Prof. Eva Tiensuu Janson, Department of Medical Sciences, Uppsala University, Uppsala, Sweden Faculty opponent Prof. David Schwartz, Department of Chemistry, Laboratory of Genetics, University of Wisconsin-Madison, WI, USA Review board Prof. Pål Nyrén, School of Biotechnology, Albanova, KTH (Royal Institute of Technology), Stockholm, Sweden Associate Prof. Mattias Mannervik, Wenner-Gren Institute, Developmental Biology, Stockholm University, Stockholm, Sweden Prof. Lars Engstrand, Swedish Institute for Infectious Disease Control and Department of Microbiology, Cell and Tumor Biology, Karolinska Institutet, Stockholm, Sweden Contents Introduction ................................................................................................... 13 The chemical structure of DNA ............................................................... 13 Sequence variation ................................................................................... 14 Analysis of sequence variation ................................................................. 15 Polymorphisms as genetic markers ..................................................... 15 Gene mapping in model organisms ..................................................... 17 Resources for sequence analysis .............................................................. 19 Methods for variant detection................................................................... 20 Complexity reduction .......................................................................... 21 Reaction principles .............................................................................. 25 Assay formats ...................................................................................... 35 Coding and decoding ........................................................................... 37 Data analysis and quality control ......................................................... 38 Aim of this thesis .......................................................................................... 39 The present study .......................................................................................... 40 Tag-array minisequencing ........................................................................ 40 Reaction steps and assay set-up ........................................................... 40 Silhouette scores for quality assessment of genotype data ....................... 43 Silhouette scores .................................................................................. 44 Enzyme performance ........................................................................... 45 Tools for gene mapping in Drosophila melanogaster.............................. 47 SNP map and FLYSNPdb ................................................................... 48 Genome-wide genotyping assays for gene mapping ........................... 48 SNPmapper for simplified data analysis.............................................. 49 Resources for fine-mapping ................................................................. 50 Mapping of genes underlying muscle development ............................ 50 Tag-array minisequencing in a lab-on-a-chip........................................... 51 Freeze-drying of reagents .................................................................... 51 Systematic genotyping comparison ..................................................... 52 Genotyping in micro-fabricated structures .......................................... 54 Sample complexity reduction for re-sequencing candidate genes............ 56 Study design ........................................................................................ 56 Capture quality .................................................................................... 57 Variant calling ..................................................................................... 57 Concluding remarks ...................................................................................... 59 Acknowledgements ....................................................................................... 62 References ..................................................................................................... 64 Abbreviations bp CNV dNTP ddNTP ExoI GA InDel kb Mb MIP MPS nt PCR RFLP SAP SBE SBS SNP µ-TAS Base pair Copy number variant Deoxynucleotide triphosphates Dideoxynucleotide triphosphates Exonuclease I Genome Analyzer Insertion-deletion polymorphism Kilobase pair Megabase pair Molecular inversion probe Massively parallel sequencing Nucleotide Polymerase chain reaction Restriction fragment length polymorphisms Shrimp alkaline phosphatase Single base primer extension Sequencing by synthesis Single nucleotide polymorphism Micro-total analysis systems Introduction The advances in the field of biomedical research have the last decades been driven by the rapid development of technology. Dideoxy “Sanger” sequencing (Sanger et al. 1977) for the determination of the base order of DNA molecules, together with polymerase chain reaction (PCR) (Mullis et al. 1986) for targeted amplification of DNA, are the two most important examples. These tools have revolutionized molecular biology and made it possible to determine the complete sequence of the human genome, of which the first draft celebrates its 10 year anniversary this year (Lander et al. 2001; Venter et al. 2001). The trend has continued through the “post-genomic” era with the development of large-scale genotyping techniques for the study of genetic sequence variation (Syvänen 2005). More recently, principles for massively parallel sequencing (MPS) have followed (Mardis 2008; Shendure and Ji 2008; Metzker 2010). These tools enable extensive genetic research. In addition, hopes are high that the technology may have an impact on clinical genetics and its transition into healthcare. This thesis addresses some of the features related to the analysis of sequence variation in the genetic model organism Drosophila melanogaster and in humans. The chemical structure of DNA The deoxyribonucleic acid (DNA) molecule contains the code for the inherited information of all living organisms (Avery et al. 1944). Commonly, DNA consists of two long polymers made up of four different types of units called nucleotides. Each nucleotide is composed of the sugar molecule deoxyribose, a phosphate group and one of the four nitrogen bases, adenine, thymine, cytosine and guanine (Levene 1919; Astbury 1947). The nucleotides are ordered in a linear sequence in a predetermined way. The two molecules are complementary to each other and twined together to form a double helix where the bases pair in a specific manner (Franklin and Gosling 1953; Watson and Crick 1953) . In each cell, these large molecules are organized into structures called chromosomes. In the fruit-fly Drosophila melanogaster the genetic information is encoded in ~180 million base pairs (bp) (Celniker and Rubin 2003) packed into four pairs of chromosomes: the X/Y sex chromosomes and the autosomes 2, 3 and 4. In humans the complete genetic code, the genome, is about 20 times bigger (Lander et al. 2001) and organized in 23 pairs of chromosomes 13 (Tjio and Levan 1956). The portions of DNA that contains the genetic information are called genes. These are read and transcribed from DNA to RNA, which in turn can be translated into amino acids to form proteins with different functions, according to the central dogma of molecular biology stated by Francis Crick (Crick 1970). Surprisingly, the number of protein coding genes in more complex organisms, like humans, is only slightly higher than in the fruit fly. It is therefore suggested that the complexity lies in the plasticity and regulation of expression of the human genome. Many studies have shown that the large fraction of non-protein coding DNA sequences have important biological functions (Mattick et al. 2010). The comparisons of DNA sequences from a great number of different organisms have revealed the common features of life. Fundamental genetic principles and functions are amazingly conserved throughout evolution considering the wide spectrum of organisms on earth. This commonality makes it possible to use model organisms for understanding principles that can be applied to more complex systems. Drosophila has been one of the most studied organisms in biological research already for almost a century, particularly in genetics and developmental biology (Burdett and van den Heuvel 2004). The fruit fly serves as an important research model and has in many aspects contributed to knowledge important for human health such as the understanding of human development and basic functions. About half of the protein coding genes in Drosophila have human homologues and 75% of known human disease genes have a homolog in the genetic code of the fruit fly (Fortini et al. 2000; Reiter et al. 2001). Sequence variation The increasing amount of genetic sequence information from multiple individuals has facilitated the study of sequence variation both between and within species. It is estimated that the haploid genome of two human beings on average differ by 0.5-1% (Levy et al. 2007; Pang et al. 2010). Together with environmental factors, these inherited signatures contribute to the phenotype of each individual and are part of what makes each of us unique. All variation in the genome has occurred due to mutation events in the DNA sequence, either caused by copying errors during cell division or by the exposure to environmental factors such as toxic compounds, radiation or viruses. Variations that arise in the germline are passed on to the offspring while somatic mutations only affect the individual organisms. Genetic sequence variations range from large chromosomal alterations to substitutions of individual nucleotides. The most abundant form is single nucleotide polymorphisms (SNPs), which are germline mutations that are fixed in the population. They 14 occur approximately every 1,200 bp in the human genome (Sachidanandam et al. 2001) and every 200 bp in the Drosophila genome (Moriyama and Powell 1996). The complete sequenced individual diploid genomes show that each human carries around 3 million SNPs compared to the human genome reference sequence, which constitute 0.1% of the total sequence (Levy et al. 2007; Bentley et al. 2008; Wang et al. 2008; Wheeler et al. 2008). In addition to the single nucleotide variations, approximately half of the human genome consists of repetitive nucleotide sequences (Lander et al. 2001), (Sachidanandam et al. 2001) (Korbel et al. 2007). Examples of structural alterations are large copy number variants (CNVs), deletions or insertions, and small (1-5 bp) insertiondeletion polymorphisms (InDels). Structural variation in total constitutes a bigger fraction of nucleotides than SNPs (Conrad et al. 2010; Pang et al. 2010) and recent studies suggest that non-SNP variation, and in particular small CNVs also arise more frequently than previously expected (Stankiewicz and Lupski 2010). Short tandem repeats (STRs) or microsatellites consist of repeated units, 2-6 nucleotides in length (Toth et al. 2000). They are often highly polymorphic but less abundant in the genome. SNPs that create or disrupt a restriction enzyme cleavage site are called restriction fragment length polymorphisms (RFLPs) (Schmidtke et al. 1984). Alternatively, a repeated sequence or insertions, deletions, translocations and inversions, can also lead to RFLPs. In the studies included in this thesis we have focused on the detection and analysis of small sequence variants including SNPs and InDels. Analysis of sequence variation The impact of sequence variation is dependent on its genomic location. Variants occurring in the coding sequence may alter amino acids of proteins and changes in intragenic regions and in introns can affect the stability of transcripts and the regulation of gene expression. The variants may be neutral, lead to an altered phenotype, or give rise to disease. Polymorphisms as genetic markers Sequence variants serve as markers in genetic studies. Variant analysis for indentifying disease genes commonly relies on one of two different approaches; linkage studies in family pedigrees, and candidate gene or genomewide associations studies (GWAS) on sets of affected and unaffected control individuals. In linkage studies the co-segregation of a characteristic with a marker is traced based on their physical proximity on a chromosome with limited recombination between them. Chromosomal segments that are shared between affected individuals in a family are identified. The altered cleavage pattern caused by RFLPs were the first type of markers to be used in linkage analysis and enabled the positional cloning of the cystic fibrosis gene in 1989 15 (Kerem et al. 1989; Riordan et al. 1989). The disease is an effect of a 3-bp deletion (delF508) in the CFTR gene, resulting in a non-functional protein product. RFLPs were later replaced by microsatellites which are multi-allelic and more informative as genetic markers. Both RFLPs and microsatellites are however difficult to analyze in a high-throughput way and recently SNPs have gained importance in linkage studies. The bi-allelic nature of SNPs makes them less informative but technically easy to analyze at a large scale in an automated way. SNPs also provide high resolution due to their high density in the genome. Linkage studies have been very successful in identifying highly penetrant rare alleles with monogenic patterns. Most diseases that are inherited in a Mendelian way have now been identified and many are analyzed in patients in routine diagnostic laboratories today Many types of genetic variants are predisposing factors in common multifactorial disorders. Recently SNPs have been widely applied as markers in association studies which rely on linkage disequilibrium (LD) or the non-random association between the loci influencing the disease and the tested marker. GWAS were initiated to identify the high frequency genetic variants underlying common disease in an unbiased way. These studies have resulted in the detection of genes for many complex disorders such as Crohn’s disease, where more than 30 loci have been identified (Barrett et al. 2008). In addition, these studies have identified novel cellular pathways important for the pathogenesis of different diseases which can pinpoint the disease mechanisms and suggest new drug targets. Despite some success, common diseases have turned out to be complicated to map and the functional variants difficult to identify. The reason for this is that they are caused by multiple genetic factors with low effect sizes in combination with environmental features. The causal alleles may also be rare and differ between individuals or among populations. Different approaches are currently discussed and tested in medical genetics to further elucidate the genetic component of complex disease (Bodmer and Bonilla 2008; Manolio et al. 2009; Cirulli and Goldstein 2010). Re-sequencing of interesting regions or whole genomes, of patients and individuals from the general population using MPS technology, is one promising means. The new sequencing technologies and “paired-end” analysis will also play an important role for precisely mapping and exploring more complex sequence variants in relation to disease and other phenotypes (Stankiewicz and Lupski 2010). Complex variants often lead to severe pathogenic phenotypes, and large chromosomal aberrations are for example commonly found in tumour cells. Many studies have shown that CNVs give rise to different mental disorders like schizophrenia (Need et al. 2009; Tam et al. 2009), but that they are frequent also in healthy individuals (Sebat et al. 2004; Korbel et al. 2007). Additionally SNPs are important as markers in population genetics, evolutionary studies, and in pharmacogenetics (Ge et al. 2009). RFLPs and microsatel16 lites have also been frequently used as markers for gene mapping and in forensic investigations (Moretti et al. 2001). Gene mapping in model organisms Forward genetic screens are performed by selecting individuals who possess a phenotype of interest and subsequently identifying the unknown underlying gene (Figure 1) (St Johnston 2002). In model organisms such as Drosophila melanogaster, mutations can be induced in individuals by exposure to a mutagen, such as a chemical or radiation. Mutants that harbour a desired phenotype are then isolated. Other mutagens, such as random DNA insertions by transformation or active transposons, can also be used to generate new mutants. An advantage with these techniques is that the new alleles are tagged by a known molecular marker that can facilitate the identification of the gene. The mutated gene is next located on its chromosome by first crossbreeding the mutants with individuals carrying other traits with known locations, and finally by estimating how frequently the traits and the mutant phenotype are inherited together in recombinant individuals. Large-scale genetic screens in the multi-cellular model organism Drosophila are powerful for gaining knowledge about cellular and developmental processes. The possibility of performing clonal screens enables the discovery of mutations underlying the affect of a given process, and the identification of phenotypes in practically any cell, or stage of development. Figure 1. In gene mapping the line carrying an induced mutation (indicated by a star) is crossed to a divergent mapping line to generate recombinants. The phenotype of the recombinants are scored as mutant of wildtype, and by combining this information with the SNP genotype the recombination breakpoints are defined and the location of the mutation is identified. 17 The identification of the mutated gene is usually the rate limiting step in genetic screens. In classical screens phenotypic traits are used to map the new mutant alleles. Traditional mapping strategies in Drosophila are recombination mapping, deficiency mapping (Parks et al. 2004) or mapping with the help of P-elements (Spradling et al. 1999; Ryder et al. 2004), which mainly rely on genetic and cytological markers that are not easily linked to the molecular map. Moreover, phenotypic traits are not very frequent, or evenly spread throughout the chromosomes. Large gene regions are “cold-spots” for P-element insertions which result in a bias for mutagenesis of certain loci. These mapping traits also suffer from variability in the genetic background. With an increasing amount of genomic sequence information available, mutations in model organisms are frequently mapped by using small molecular polymorphisms as mapping traits. Examples of such are microsatellites, InDels and most commonly SNPs. In contrast to the classical mapping techniques mentioned above, SNP mapping offers many advantages including phenotypic neutrality, defined genetic background, high resolution and even distribution in the DNA sequence between common laboratory strains, and the feasibility to analyze all mutational classes (Martin et al. 2001). In several model organisms, mapping of genes with the help of polymorphisms has thus proven to be the preferred method. SNP mapping in Drosophila has relied on traditional genotyping methods like analysis of RFLPs, PCR product-length polymorphism (PLP) assays, and denaturing high performance liquid chromatography (DHPLC) (Hoskins et al. 2001) (Berger et al. 2001). These methods are well established and easy to perform with standard equipment. They are however hampered by the low throughput, and the genotyping procedure has therefore been a bottle-neck in mapping studies. Recently there has been a enormous technology development in the genotyping field, and large-scale systems are available for parallel genotyping of large SNP sets at a low cost per genotype (Syvänen 2005), described later on in this thesis. The current systems are however available mainly for fixed human applications and often involve huge investments of specialized instrumentation. As stated by Macdonald S. et al (Macdonald et al. 2005) it is difficult to identify an available system for efficient, cost-effective genotyping at an intermediate scale outside of human genetics, and there is a need for open-source and off-the-shelf solutions to fill this gap. In Study II we developed a cheap microarray-based genotyping system to facilitate SNP mapping in Drosophila. In reverse genetic screens, mutations are instead generated in known genes or at known genome positions, and the phenotypic consequence is subsequently studied. With the genome sequence available it has become possible to silence genes using RNA interference (RNAi) in model organisms such as Caenorhabditis elegans (Fire et al. 1998) and Drosophila melanogaster (Pal-Bhadra et al. 1997; Neely et al. 2010). 18 Resources for sequence analysis The advances in molecular biology and the large genome projects have resulted in publicly available resources, including sequence databases and analysis tools, for a great number of organisms. Human and Drosophila are among the organism with the most comprehensive tools available. The National Center for Biotechnology Information (NCBI) was initiated by the National Institutes of Health (NIH) already in 1988, and is today the largest resource for biomedical and genomic information (http://www.ncbi.nlm. nih.gov/). The Ensembl project is a joint effort between EMBL-EBI and the Wellcome Trust Sanger Institute since 2000 to establish genome databases for vertebrates and other eukaryotic species including automatic sequence annotations (http://www.ensembl.org). A similar resource is provided by the University of California Santa Cruz (UCSC) (http://genome.ucsc.edu). One of the databases associated with the NCBI is the Single Nucleotide Polymorphism database, dbSNP (http://www. ncbi.nlm.nih.gov/SNP/) (Sherry et al. 2001). It serves as a central repository of SNPs and short InDels, currently for 58 different organisms. The majority of the polymorphisms are human and build 131 (April 2010) comprised 23.7 million variants of which 14.7 million have been validated. There has been a tremendous increase in the number of variants reported in the last years, primarily due to the increased amount of human sequence data generated by MPS projects. By the effort of the international Haplotype mapping (HapMap) project, a catalogue of common human variation (>1%) in individuals from different populations has been established (The International HapMap Consortium 2005; Frazer et al. 2007). The HapMap phase III data was released at the end of May 2010 and in total the resource today comprises data for 1301 individuals from 11 populations (http://hapmap.ncbi.nlm.nih.gov). The goal of the HapMap project has been to identify the chromosomal regions where genetic variants are shared in the different populations, the common haplotype blocks, and to identify so called "tag" SNPs that uniquely identify each of these blocks. The 1000 genomes project was initiated in 2008 with the aim of sequencing the complete genomes and gene regions of a large number of individuals using the new MPS technology (http://www.1000genomes.org). The goal is to provide a comprehensive resource on human genetic variation, and to discover and characterize rare single nucleotide alleles that are present at minor allele frequencies around 1% or less. FlyBase is an extensive bioinformatics database for the genome of the model organism Drosophila melanogaster and related species (http://flybase.org). It includes the highest-quality genome sequences and annotations available for any organism (Ashburner and Bergman 2005; Tweedie et al. 2009). Despite the big toolbox available for researchers in the Drosophila community 19 (Venken and Bellen 2005) it has until recently lacked a large-scale catalogue of small molecular polymorphisms, similar to the human dbSNP. Study IIa of this thesis includes an effort to indentify sequence variants in common Drosophila strains and to establish a high resolution variant database (Chen et al. 2008; Chen et al. 2009). Methods for variant detection The large-scale genomic efforts mentioned above have provoked a rapid development in technology for the analysis of sequence variation, primarily SNPs. Today there are many genotyping systems available, both commercial and open source, for scoring known SNPs in the whole complexity range from single markers to genome wide analysis. Figure 2 compares different genotyping methods in terms of sample and marker throughput. Recently there have been significant efforts to establish large-scale genotyping systems where key aspects have been cost efficiency per genotype, automation, and parallelisation of the biochemical reactions and the assay format. As a result, commercially systems for genetic analysis on a very large scale, and based on microarray technology, are now available in human genetics (Syvänen 2005). The largest assay for highly multiplexed SNP genotyping, provided by Illumina (Steemers and Gunderson 2007), allows parallel scoring of 2.5 million SNPs, to be increased to 5 million before long (www.illumina.com). Nevertheless, the genotyping systems available today still suffer some limitations. With the growing numbers of SNPs identified it is desirable to further lower the cost per genotype to facilitate wide-spread large-scale genotyping. Additionally, the chips should be extended to comprise not only common SNPs but also rare variants. The largest, genome-wide genotyping approaches are primarily developed for fixed panels of markers to identify disease predisposing SNPs that are associated to complex diseases in humans (Hirschhorn and Daly 2005) and require significant investments in equipment with high reagent costs. Moreover, the systems are inflexible for adaptation to specialized and novel applications. Therefore, cost-efficient genotyping systems on an intermediate scale are needed. These will be required for replication of GWAS findings, for follow-ups of sequencing studies, as well as for future genetic analysis and diagnostics of previously identified risk variants. Moreover, genotyping in non human systems has a need for good solutions and better intermediate scale systems in an open-source manner. Although microarray-based methods have been successfully adapted for the analysis of CNVs (Redon et al. 2006; Craddock et al. 2010), large-scale analysis of more complex structural variants needs to be addressed further by new approaches (Teague et al. 2010). 20 The existing genotyping platforms all combine different reaction principles for allele discrimination with a number of assay formats and strategies for labelling and for read-out of individual signals, in a modular fashion (Kwok 2001; Syvänen 2001). Below follows a description of these modules and examples of technology. Figure 2. Comparison of SNP multiplexing levels and sample throughput in common SNP genotyping systems. In the Axiome system Affymetrix has increased the sample throughput of their hybridization-based method. Complexity reduction Direct analysis of sequence variants in the complexity of the genome of higher organisms is technically challenging. To achieve a sufficient sensitivity and specificity, most techniques for genotyping or sequencing require either amplification of the starting material for further analysis, and/or amplification of the read-out signal after processing. Despite the recent improvements in MPS technology, it is costly and time consuming to sequence the whole genome of complex organisms in a large number of individuals (Summerer 2009; Turner et al. 2009; Mamanova et al. 2010). Additionally, the amount of data generated and the data handling is overwhelming for most laboratories today. Hence, in many MPS studies it is desirable to reduce the complexity of the 21 template to perform targeted analysis on portions of the genome in multiple individuals, and utilize the sequencing capacity of MPS machines in a more optimal way. Polymerase chain reaction Two commonly used approaches for amplification of specific genomic fragments are cloning (Cohen et al. 1973) or amplification by polymerase chain reaction (PCR) (Mullis et al. 1986). PCR is powerful for increasing sensitivity and specificity for downstream assays. With PCR a locus-specific region of the genome is amplified, which increases the number of template copies and reduces the complexity of the DNA to be analysed. The principle requires very low amounts of input material. For over a decade it has been a golden standard for copying of DNA. Recently, long range PCR has been adapted for targeting smaller regions for MPS (1-100 kb) (Mamanova et al. 2010), and applied to sequence candidate genes in pools of a larger number of individuals (Nejentsev et al. 2009). The increasing amounts of genes, SNPs, and sequences to be analyzed, require procedures for parallel amplification. However, two problems with multiplex PCR are uneven amplification efficiency due to sequence-dependent differences of the fragments, and an exponentially increase of amplification artefacts as the number of added primer-pairs grow. Presently, multiplex PCR reactions are a bottleneck in highly multiplex genotyping methods, since careful design and optimization of PCR assays at multiplexing levels exceeding 10-20 amplicons is difficult and time consuming, even with better design software. One way to circumvent the difficulties with multiplex PCR is to physically separate individual DNA molecules before clonally amplifying them. PCR colonies, or “polonies” (Mitra and Church 1999) enable parallel amplification of DNA by separating PCR reactions in a thin polyacrylamide film, and in emulsion PCR (Dressman et al. 2003) clonal amplification of fragments from single DNA molecules is performed in a water-in-oil emulsion. Recently, micro-droplet based PCR was applied for the parallel amplification of 457 fragments (172 kb total genomic size) (Tewhey et al. 2009) for subsequent MPS. It is currently limited to ~4000 primer pairs and a throughput of 8 samples per day. Additionally it requires dedicated instrumentation and as much as 7.5µg of input material. One example of an alternative PCR-strategy is rolling circle amplification (RCA) and circle-to-circle amplification, presented by Dahl et al (Dahl et al. 2004), where a circularized target molecule is repeatedly amplified to generate a long linear molecule containing tandem repeats of the target sequence. Reduction of genome complexity has also proven successful by the amplified fragment length polymorphism (AFLP) approach. In AFLP, genomic DNA is cleaved by restriction digestion followed by PCR-amplification with universal linker-primers and fragment size selection. This approach has been used for 22 SNP genotyping (Vos et al. 1995) and the same principle is employed in the Affymetrix GeneChip system (Kennedy et al. 2003; Matsuzaki et al. 2004). A reversed approach is taken by methods which are based on initial allele detection directly in genomic DNA, followed by PCR with universal primer pairs for signal amplification (Oliphant et al. 2002; Baner et al. 2003). Illumina applies this solution in their GoldenGate assay for multiplexed genotyping in large-scale projects (Fan et al. 2003). For undirected DNA amplification, a number of whole genome amplification (WGA) strategies in different forms are available (Lovmar and Syvänen 2006). These are powerful for increasing the total amount of DNA and to secure the source material. Multiple displacement amplification (MDA) is the most commonly used approach, which is based on an isothermal and branchlike amplification using the DNA polymerase of the bacteriophage phi29 and random primers (Dean et al. 2002). Illumina has in their Infinium II assay implemented an initial step of MDA followed by allele discrimination and finally signal magnification with an antibody sandwich strategy (Steemers and Gunderson 2007). Capture by circularization Enzymatic reactions and oligonucleotides are employed for target capture by circularization. Padlock probes/molecular inversion probes (MIPs) (Nilsson et al. 1994) are linear single stranded oligonucleotides consisting of two target specific sequences flanking a universal linker for amplification. They require a highly specific dual recognition event when hybridized to their target molecule, which enables a filling of the “gap” and a ligation to generate a circular library molecule. MIPs have been applied for large-scale target capture at Mb scale and MPS for; analysis of 10,000 human exons (Porreca et al. 2007), identification of genetic variation in hypermutable CpG regions (Li et al. 2009) methylation profiling (Ball et al. 2009) or to screen predicted RNA editing sites (Li et al. 2009), using array-released oligonucleotide libraries. The principle eliminates the need for additional shot gun library preparation. Since it is solution-based it is easily scaled up and automated using standard molecular laboratory equipment. The main limitations are the capture uniformity and the upfront cost of probes for large target regions and the difficulty to obtain large amounts of each probe species. Double stranded Collector probes (Fredriksson et al. 2007) and Selector probes (Dahl et al. 2005; Dahl et al. 2007) follow a similar reaction principle for target capture. These reactions include an initial multiplex PCR or restriction digestion after which the probes are used to guide the circularization of the target molecules. They have been applied for sequencing the coding sequence of 10 human cancer genes (Dahl et al. 2007; Fredriksson et al. 2007) but currently only enables intermediate scale enrichment. 23 Capture by hybridization Targeted pull-down reactions based on hybridization of target molecules to oligonucleotide microarrays have been performed for enriching ~5Mb exonic regions per reaction for MPS (Albert et al. 2007; Hodges et al. 2007; Okou et al. 2007). This approach, provided by Roche/Nimblegen (http://www.nim blegen.com), requires an initial library preparation step which is followed by hybridization to target specific oligonucleotide arrays containing 385,000 probes (Figure 3A) and recovery and amplification of the captured targets. Recently the principle has been further scaled up to facilitate the capture of up to 34Mb on a single array and applied for targeting the whole human exome for genetic diagnosis of Bartter syndrome (Choi et al. 2009). It is available for exon capture, and can be designed for custom and continuous target regions. The ability to capture large regions is a strength while the array-based format requires expensive hardware and gets laborious for larger number of samples. Additionally, it requires rather high amount of starting material, regardless of the total size of the target region. A similar approach in a fully automated microfluidic format has also been used to capture a 480 kb exome subset of 115 cancer-related genes (Summerer et al. 2010). This format is still only adaptable for targets up to 1Mb in size but has the advantage of the integrated microfluidic format, promising for scaling up and for the adaption to clinical applications. Gnirke et al. presented a hybridization-based capture method in solution for MPS using long (170-mer) RNA probes to capture >15,000 coding exons and four continuous regions, 4.2 Mb in total size (Gnirke et al. 2009) (Figure 3B). Today Agilent Technologies (http://www.home.agilent.com) are providing the SureSelect target capture method in solution for MPS both for exon sequencing and for customized capture. The method involves the same reaction steps as the array capture but overcomes many of the obstacles related to the array format. The reactions are easily parallelized and automated for increased sample throughput and are performed using standard molecular biology instrumentation. Compared to the array format the hybridization is driven further to completion by an access of probe molecules per target molecule and the reaction time is decreased. The probe cost decreases with the number of samples analysed, making the system suitable for large-scale analysis. For smaller studies the cost per capturing experiment is however favourable with the Nimblegen arrays. 24 Figure 3. In the first step a sequencing library is generated through DNA fragmentation, end-repair and universal adapter ligation. (A) The Nimblegen sequence capture method uses probes that are synthesized on microarray slides, with a varying probe length (60-90nt) to yield a uniform melting temperature. Library molecules are hybridized to 385K feature arrays in a custom incubation station for 65 hours, after which unbound fragments are washed away and the enriched portion is eluted. (B) The SureSelect method uses biotinylated RNA capture probes, 120 nucleotides in length, to which the library molecules are hybridized in solution for 24 hours in a regular thermocycler. The reaction is mixed with magnetic, streptavidin-coated beads to capture probe bound library molecules, followed by washing and elution. In both methods, the capture products are finally amplified by universal PCR by means of the adapter sequences. Reaction principles Most biochemical reaction principles used to determine the identity of a given nucleotide in a DNA sequence, are based on short oligonucleotide probes or primers. The methods rely either on allele-specific hybridization or enzymeassisted allele discrimination. Some of these principles are discussed below and shown in Figure 4. 25 Allele-specific oligonucleotide hybridization (ASH) In many approaches the hybridization between complementary DNA strands is employed for allele discrimination. Two allele-specific oligonucleotides are designed to differ in the variable base position and hybridized to the target DNA under stringent conditions (Wallace et al. 1979). The difference in thermal stability between the perfectly matched and the mis-matched probe is used to distinguish between SNP alleles. Amplified target DNA can be hybridized to high-density oligonucleotide arrays containing allele-specific primers for high-throughput genotyping (Pease et al. 1994; Chee et al. 1996). Perlegen Sciences (closed operations in 2009) developed a system where long-range PCR and array hybridization was combined for multiplex genotyping of ~50,000 SNPs (Hinds et al. 2005), which among other things was used in the HapMap project. The GeneChip system by Affymetrix allows genotyping of up to 1.8 million SNP and CNP markers by array hybridization. The hybridization characteristics of the probes are largely influenced by the sequence surrounding the SNP and the reaction conditions. This requires careful SNP selection and up to 40 different oligonucleotides per SNP to obtain high specificity. In dynamic allele-specific hybridization (DASH) (Howell et al. 1999) the template-probe duplex stability is instead monitored during a temperature increase. DASH has only been multiplexed for two SNPs per reaction and used in small genotyping projects (Jobs et al. 2003). Hybridization-based approaches are simple and do not require expensive enzymes. Allele specific probes for real-time detection DNA amplification during PCR can be monitored directly in an allele-specific way using double-labelled, allele-specific oligonucleotide probes. The 5’exonuclease or TaqManTM assay uses specialized probes and the 5’-3’ nucleolytic activity of Taq DNA polymerase for allele discrimination (Holland et al. 1991; Lee et al. 1993; Livak et al. 1995). The probe contains a fluorescent reporter label in its 5’-end and a quencher in its 3’end. The signal from the reporter is suppressed when the two molecules are in proximity of each other. During the extension step of PCR the probe is displaced and cleaved by the polymerase, whereby the fluorophore is released and a detectable fluorescence signal is generated. This assay has been frequently used for single marker genotyping and serves as standard method. Molecular beacons are another type of double-labelled probes that form a hairpin structure in solution leading to quenching of the signal (Tyagi and Kramer 1996; Tyagi et al. 1998). Hybridization of the probe to a PCR product breaks the hairpin shape and separates the probe ends which gives rise to a signal. Allele-specific primer extension (ASE) The strategy is based on the ability of a DNA polymerase to only extend allele-specific primers with a perfectly matched 3’-end (Gibbs et al. 1989; Wu et 26 al. 1989), either directly on the target sequence or on already amplified samples. The discrimination between matched and mis-matched probes by the polymerase is however not perfect which leads to low specificity. One approach to increase the selectivity of the reaction is to include the enzyme apyrase. It inactivates the nucleotides and competes with the primer extension action of the DNA polymerase (Ahmadian et al. 2001). A beneficial feature of the principle is a cost reduction due to the use of un-labelled probes, although it still involves some visualization of the genotyping results. Allele-specific extension has been applied for interrogating low numbers of variants, as in Illumina’s custom genotyping system using cylindrical glass microbeads (Veracode beads), and in specialized assay formats (Mitani et al. 2007). Single base primer extension (SBE) The incorporation of a single nucleotide in an allele-specific manner is called single base primer extension (SBE) or minisequencing (Syvänen et al. 1990). One detection primer that anneals directly adjacent to the SNP position in the target molecule is used per reaction. The detection primer is extended with a nucleotide by a DNA polymerase, guided by the complementary template strand. The accuracy of the enzyme governs the high specificity of the reaction. The extension reaction is independent of the nature of the variable nucleotide and the sequence context surrounding it, which allows most SNPs to be analysed at the same reaction conditions. Furthermore, the high specificity makes the reaction principle suitable for multiplexing. In an early comparison, a microarray-based minisequencing assay showed approximately one order of magnitude better genotype discrimination than allele-specific hybridization in the same assay format (Pastinen et al. 1997). The robust underlying principle of minisequencing is the basis for its successful integration into numerous assays at different levels of complexity, in various reaction formats and with different labelling strategies, for the analysis of nucleic acids. It is the basis of the SNPStream system from Beckman Coulter (Bell et al. 2002) and the Infinium II platform provided by Illumina (Steemers and Gunderson 2007). Further the minisequencing principle has also shown excellent performance in quantitative applications, especially with tritium-labelled nucleotides, or when using mass spectrometric detection (Syvänen et al. 1993; Buetow et al. 2001; Milani et al. 2007). Minisequencing is the underlying reaction principle in the tagarray based genotyping system which was used in Studies I-III of this thesis. This system is discussed in more detail in the present study section. 27 Figure 4. Hybridization-based and enzymatic principles for allele discrimination. The reporter and quencher of the dual-labelled probes are denoted R and Q, respectively. 28 Oligonucleotide ligation (OLA) As for the polymerase-mediated extension reaction, the joining of two perfectly matched DNA strands by DNA ligase is a highly specific event. DNA ligase has hence been used to discriminate between two allele-specific probes that are hybridized at a SNP site in a PCR product. When there is a perfect match, the ligase seals the nick between one of two allele-specific probes and a third, locus specific probe binding upstream of the SNP site. This principle, denoted oligonucleotide ligation assay (OLA) was first described in 1988 (Alves and Carr 1988; Landegren et al. 1988) and is applied in the SNPlexTM system from Life Technologies (Applied Biosystems, http://www.lifetech nologies.com) for 48-plex SNP genotyping. The concept was adapted in the development of the padlock probe (Nilsson et al. 1994). The padlock probe is a linear oligonucleotide which contains the sequence from both OLA probes in either end, joined together with a linker sequence. The highly specific reaction requires two adjacent recognition events followed by an enzymatic ligation to form a circular molecule. The circle is amplified by rolling circle amplification or by universal PCR by sequences in the linker before signal readout. Padlock probes have recently been used for single molecule detection in situ (Larsson et al. 2010). Single base extension was combined with ligation of molecular inversion probes for high throughput SNP genotyping of 12,000 markers (Hardenbol et al. 2003; Hardenbol et al. 2005), commercially offered by ParAllele Biosciences, now part of Affymetrix. The polymerase extends the “gap” that is created at the SNP position before sealing by ligation. This approach requires only one probe per queried SNP but separate extension reactions with one type of nucleotide present in each. Un-circularized probes are digested by exonuclease following amplification of the circles and signal read out on arrays. In the GoldenGate assay, developed by Illumina (Fan et al. 2003), the enzymatic discrimination is combined with hybridization for a highly specific genotyping assay for up to 1,536 SNP. Hybridization of allele-specific primers directly to genomic DNA is combined with an extension reaction and a subsequent ligation to a locus-specific probe. All oligonucleotides contain sequences for universal PCR amplification of the generated template with labelled primers, and the locus-specific probe carries a barcode for capturing of the ligated probe complex to bead-arrays, Veracode beads, or bead-chips. Invasive cleavage In the invasive cleavage method, the Invader assay, a structure-specific flap endonuclease (FEN) is used to cleave a complex formed by two allele-specific overlapping oligonucleotides that are hybridized to a target DNA containing a polymorphism (Lyamichev et al. 1999). The three-dimensional structure triggers the cleavage of the oligonucleotide and the difference in cleavage rate 29 between complementary and non-complementary probes is the basis for allele discrimination. The Invader assay principle has been developed for a variety of assay formats and detection strategies (Olivier 2005). Its advantages include high accuracy, the isothermal reaction which does not require thermal cycling and the possibility to perform genotyping directly on genomic DNA. Despite the improvements, the method uses fairly large amounts of input DNA and was only successfully multiplexed for the analysis of ~100 SNPs (Ohnishi et al. 2001). Sequencing DNA sequencing is nowadays the most commonly used method to identify specific nucleic acid sequences. It has played a great role in the identification of novel sequence variants but is also used in small scale genotyping studies and for validation. Sequencing technology was pioneered in the mid seventies by Maxam and Gilbert through their chemical sequencing method (Maxam and Gilbert 1977). At the same time, the less cumbersome chain-terminating method, based on incorporation of labelled terminating dideoxynucleotide triphosphates (ddNTPs) and electrophoretic separation of DNA fragments of varying length, was published by Sanger et al (Sanger et al. 1977). The method uses transfected cloning vectors or PCR products as template for sequencing. In the early versions the sequencing primers were radioactively labelled and only one out of four terminating nucleotides was present in each reaction in addition to unlabelled deoxynucleotide triphosphates (dNTPs). The development of this method has been remarkable over the years. Today Sanger sequencing is performed in a single reaction with all four, fluorescently labelled nucleotides present (Smith et al. 1986; Prober et al. 1987). Other improvements include the development of whole-genome shotgun sequencing and separation of the sequencing products on automated highthroughput capillary DNA sequencers (Drossman et al. 1990). The sequence of Drosophila melanogaster was published in 2000 as the first application of the whole-genome shotgun approach to sequencing of an animal genome (Adams et al. 2000).With sequencers such as the ABI3730xl DNA Analyzer from Life Technology, up to 1000 bases can be read in 96 or 384 samples per run. The technique has been crucial in the large genome projects but has also played a major role as a standard method for many genetic research approaches. Pyrosequencing is an alternative principle for sequencing (Ronaghi et al. 1996; Ronaghi et al. 1998). Different systems applying the principle are today commercially available through the company Qiagen (http://www.pyrose quencing.com). Its underlying reaction principle is sequencing by synthesis (SBS) (Melamede 1985), in which SBE has been further developed to determine the sequential nucleotide incorporation during strand synthesis by the DNA polymerase. In pyrosequencing the nucleotides are resolved through an 30 enzymatic cascade resulting in a detectable signal. Nucleotide incorporation initiates a release of pyrophosphate (PPi) which together with adenosine phosphosulphate (APS) act as substrates for ATP generation by ATP sulfurylase. The energy in ATP is converted to detectable light by luciferin and the signal is directly proportional to the number of incorporated nucleotides in a nucleotide flow. Each nucleotide cycle contains one of four dNTPs and between dispensions the unincorporated nucleotides and the remaining ATPs are degraded by apyrase. The degradation is crucial to avoid washing procedures in between cycles and decrease accumulation of light signals that lead to high background levels at longer read-length. In an effort to increase the readlengths, up to 100 bases could be determined (Gharizadeh et al. 2002). Reliable quantification of the signal is a problem when reading homopolymeric sequences. The usage of multiple enzymes makes the assay rather expensive. Pyrosequencing has been used in small scale genotyping projects and is frequently used for clinical applications. Massively parallel sequencing Despite the tremendous improvements, Sanger sequencing has reached its limits with respect to cost and parallelization (Shendure et al. 2005), and is too expensive for whole genome sequencing of large genomes. To stimulate innovation in the field, a sequencing race was initiated by the J. Craig Venter Science Foundation in 2003, to award the first complex (human or corresponding complexity) whole-genome sequenced for less than 1000 USD. In 2004 this strive was further encouraged through a large grant program for sequence technology development by the NIH. The rationale was that 1000 USD would be a reasonable cost to make multiplex large-scale sequencing a standard method. Among numerous promising applications it would enable comprehensive analysis of the mutational spectrum of an organism. The efforts have indeed led to the development of several emerging sequencing technologies and new innovative solutions with the ability to process millions of sequence reads in parallel; so called next or second generation sequencing technology (Mardis 2008; Shendure and Ji 2008; Metzker 2010). Ensemble-based methods involve multiple clonally amplified molecules. The SBS strategy described above using DNA polymerases and ligases, has been implemented for parallel analysis of DNA strands (Fuller et al. 2009). Nucleotides or short oligonucleotides are used for determining the base type. The established methods all employ synchronized sequencing cycles in an iterative fashion by adding a single kind of nucleotide at a time or by adding nucleotide substrates that are reversibly blocked. Although the need for vector-based cloning strategies is eliminated, all of the methods require an initial sequencing library preparation step. This involves fragmentation of the molecules and incorporation of general priming motifs, i.e. adapter sequences, to both ends of a fragment for amplification and sequencing. The platforms can determine 31 either one end, or the paired-ends of a given molecule. They also allow the inclusion of barcodes for sample indexing and pooled sequencing approaches. 454 sequencing, today commercially provided by Roche (http://www. roche.com), was the first MSP technique described, published in 2005 (Margulies et al. 2005). It is based on emulsion PCR, where library molecules are clonally amplified en masse on agarose beads, and the pyrosequencing chemistry for nucleotide discrimination conducted in a picotiter plate format (Leamon et al. 2003). Currently the Genome Sequencer FLX system generates more than one million high-quality reads per each 10 hour instrument run with a read length of about 500 bases. One drawback with the technique is the problem of reading homopolymers, which is an effect of the pyrosequencing reaction. The sequential flow of nucleotides however diminishes the occurrence of substitution errors in the sequence. The long reads simplifies the alignment and makes the system ideally suited for de novo sequencing of whole genomes and transcriptomes, including partially non-unique sequences. The second MPS system was developed by the company Solexa and introduced on the market in 2006 (Bennett 2004; Bennett et al. 2005) (Figure 5). Today the Solexa technology is commercially available in the Genome Analyzer (GA) platform from Illumina. Library molecules are bound to oligonucleotide arrays in a flowcell and amplified by solid phase bridge amplification to generate millions of clonal template clusters. Each round of SBS involves the addition of four differently labelled and reversibly terminating nucleotides and four-color imaging of the incorporated base. The 3’-terminus of the incorporated base is de-blocked and the fluorophore is removed before the next sequencing cycle is initiated. Each flow cell can hold eight separate reactions or samples. The current throughput of the GAIIx system is 15-20 million paired-end reads per sample of ~2x100 bp length, adding up to a total of 20 Gbp generated in each ten day machine run. In 2010, the company announced the launch of an updated version of the system (HiSeq2000) that will generate approximately ten times more data in each run, mostly due to better usage of their flow cells which increases the surface area. The advantage with the Solexa technology is the large amount of data generated at a fairly low price, while the relatively short read-length still is an obstacle for some applications and made it initially suitable for re-sequencing applications. With the constant increase in read-length it has also been applied for whole human genome sequencing (Bentley et al. 2008) and it is the technology used by the 1000 genomes project. The mutiplex polony sequencing by ligation method (Shendure et al. 2005), uses mate-pair library molecules that are constructed by circularization of template molecules, RCA, restriction cleavage and adapter ligation. The linear library molecules contain two genomic target sequence tags of 17-18 bp, each 32 separated and flanked by universal sequences. The fragments are captured to beads and amplified in an emulsion PCR step which is followed by immobilization in an acrylamide gel to form a monolayered array of beads. The sequencing strategy utilizes an anchor primer that hybridizes directly 5’ or 3’ of one of the genomic sequence tags. Differently labelled degenerate nonamers are added to the reaction, followed by a selective ligation between the anchor primer and the complementary nonamer. The base identity at the query site in the nonamer correlates with its label. After signal read-out the primer-probe complexes are stripped away and the next cycle begins with the addition of a new population of nonamers with a different query position. This permits sequencing with high specificity of 13 bp (6+7 from each end) in each genomic tag and a total of 26 determined nucleotides per amplicon. Proof of principle was shown by sequencing the genome of an Escherichia coli. It has been further developed in the Polonator G.007 (http://www.polonator.org), an open platform provided by Dover Systems with open-source software and protocols and off-the-shelf reagents at a very competitive price. Figure 5. Schematic drawing of the Genome Analyzer system using Solexa technology for MPS. (A) Sequence library molecules are via their universal adapter sequences randomly hybridized to tags in one of eight lane of a microfluoidic flowcell. This is followed by bridge amplification on the surface to form millions of clonal molecule clusters (B). (C) The sequence of the molecules is deduced by SBS with four differently labeled, reversibly terminating nucleotides present. Each incorporation of a complementary base is followed by washing and (D) four colour imaging. 33 The sequencing by oligonucleotide ligation principle has also been applied in the SOLiDTM system from Life Technologies. Library molecules are captured to oligonucleotides on magnetic beads, amplified by emulsion PCR and attached to glass slides. The sequencing process starts with the annealing of an adapter specific sequencing primer and a set of four fluorescently semidegenerate labelled di-base probes. A probe match results in a specific ligation event and a fluorescent read-out where the label corresponds to the first two positions in the probe. The last two bases of each probe including the label are removed to enable the next sequencing cycle. Multiple cycles of ligation, detection, and cleavage are performed before the extension product is removed, and the template interrogated in a second round of ligation cycles with a primer complementary to the n-1 position. By this reset approach each base is interrogated twice which enables a read accuracy check. Two drawbacks with the technology are the short read length and the somewhat complex data output format in colour space. The current version of the instrument (SOLiDTM4) can generate up to 100 Gb of mappable sequence or about 1.4 billion reads of 50 bp length per ~11 day run. With the SOLiD™ 4hq System upgrade this scales to 300 Gb of sequence in 2.4 billion reads of up to 75bp in length per run. The three major players described above have all introduced smaller systems during 2010 to better suit the needs of individual laboratories or for clinical applications. These are convenient for targeted re-sequencing, pathogen detection and de novo sequencing of small genomes. General features include high speed, easy workflows and simplified data analysis. In addition to the systems mentioned above the company Complete Genomics (http://www.complete genomics.com) offers whole-human-genome service sequencing with a technique that resembles the principles employed in Polony sequencing (Drmanac et al. 2010). Many of the problems related to ensemble-based sequencing methods can be overcome with single molecule sequencing, also referred to as third generation sequencing. These include the difficulty of generating long reads because of phasing problems of signals from populations of molecules, the need for template amplification, and the fairly large consumption of sequencing reagents and template material for library preparation. The company Helicos Biosciences (http://www.helicosbio.com) developed an amplification-free method based on sequential SBS of single molecules attached to glass slides and signal readout using fluorescence microscopy (Braslavsky et al. 2003). The average read-length today is 35 bases. Pacific Biosciences and Life Technology both have real-time techniques using a “free running” DNA polymerase for single molecule SBS on their way. Pacific Biosciences (http://www.pacific biosciences.com) are currently launching their SMRTTM method where DNA polymerase molecules are attached to zero-mode waveguides (Levene et al. 34 2003). After the addition of single molecule templates the incorporation of labelled nucleotides by each individual DNA polymerase molecule is detected. The waveguide enables specific detection of incorporated labels among a background of thousands of labelled nucleotides. The DNA polymerase incorporates 1-3 bases per second in the initial commercial release and generates a long DNA chain in minutes. Life Technologies single molecule sequencing system employs a similar but reversed approach. Here the single molecule templates are attached to a solid support to which individual DNA polymerase molecules are captured for sequence determination. This enables addition of new enzymes when the polymerase activity decreases. The incorporation of nucleotides is detected by using quantum dots and FRET chemistry. Ion Torrent use semiconductor technology for determining the identity of natural nucleotides that are sequentially incorporated by a DNA polymerase (http://www.iontorrent.com). When a base is added to a DNA stand a hydrogen ion is released. The ion charge changes the pH of the solution which can be detected by a sensor. Currently this technique is not at the single molecule sensitivity level. Exonuclease–mediated cleavage of labelled nucleotides can also be utilized for sequencing (Werner et al. 2003). The same principle has been combined with nanopores which can recognize single unmodified deoyribonucleoside monophosphates (dNMPs) (Astier et al. 2006). The pores act as base sensors when a dNMP interacts with the binding site and an adapter in the pore, resulting in a measurable change in ion current. Assay formats The most common format for conducting biochemical reactions has been in solution in test tubes. Homogeneous assays, where all reactions take place in the same test tube, require less laboratory handling but are limited in the multiplexing level of the reactions and are less sensitive due to lack of separation possibilities. This format is among others applied in TaqMan assays (Holland et al. 1991). Different solid supports have however gained importance due to the possibility of parallelisation of samples and reactions. Solid-phase assay formats enable immobilization of targets by capturing to oligonucleotides or to chemically activated surfaces, which is convenient for multi-step reactions that require multiple separation steps. The microarray format, first presented in 1992 (Southern et al. 1992), consists of a miniaturized solid-support commonly made of glass. It facilitates parallel and miniaturized analysis with reduced costs and usage of DNA. Since the arrayed format was introduced a number of supports and strategies have been developed for target organisation. These supports include microtiter plates, micro-beads in solution or assembled on microwells of fibre-optic bundles (Michael et al. 1998) or silica slides (Gunderson et al. 2004), gel films (Dufva et al. 2006), and various types of 35 micro-structures (Summerer et al. 2010). Apart from wide usage in RNAexpression analysis, microarrays are commonly used in DNA analysis and are gaining importance for analysis of proteins, antibodies and tissue. Microarrays have been essential in the development of large-scale SNP genotyping systems. Oligonucleotide arrays can be produced by in situ synthesis of the molecules on the surface, and Affymetrix use a photolithographic process for their on-chip synthesis. More commonly, oligonucleotides are immobilized to activated surfaces by different printing techniques. The early microarrays for genotyping contained immobilized SNP-specific probes (Shumaker et al. 1996; Pastinen et al. 1997; Pastinen et al. 2000). “Tags”, on the contrary, are barcode sequences complementary to sequence motifs included in the detection primers of enzyme-assisted genotyping assays. When combined with genotyping reactions in solution, tags allow flexible assay design. In addition, their generic characteristic enables the development of universal capture arrays, which simplifies manufacturing. Tag-arrays were initially applied to SNP genotyping in a ligase assay (Gerry et al. 1999) and further combined with SNP genotyping by minisequencing (Fan et al. 2000; Hirschhorn et al. 2000). This assay format has been utilized in Studies I-III of this thesis. Miniaturization Genetic analysis is becoming increasingly important in clinical genetics and pharmacogenetics. Genetic variants that are found to be risk factors for inherited diseases will be analysed to predict the susceptibility and onset of disease, or the treatment response and clinical outcome. In addition, the identification and scoring of somatic changes that initiate cancers and the identification of the allelic spectra of human pathogenic microbes, will also be important. There is great hope that the enhanced understanding of the effect of genetic variation may be implemented for diagnostic purposes to enable better, more individual treatment of each patient. Large-scale genotyping techniques require massive equipment and timeconsuming, labour intensive procedures and analysis. These methods provide high-throughput analysis with a low price per genotype. Progress in the field of clinical genetics is however dependent on new assay formats and robust, yet flexible technology that is fast, sensitive and inexpensive per sample, when smaller numbers of clinical samples are to be analyzed (Ohno et al. 2008; Weigl et al. 2008). Miniaturization has been an efficient means for enhanced and portable information technology. Also in life-science, micro-total analysis systems (µ-TAS) or lab-on-a-chip have the capability of integrating miniaturized components and assay functionalities of multiple reaction steps, at pL-nL volume scale, into a single microfluidic device. The advantageous features of these systems are automation, low risk of contamination, low reagent and sample consumption, decreased assay time, increased sensitivity, and potentially high level of parallelization (Liu and Mathies 2009). Hence, 36 these technologies offer promising tools and analysis solutions for efficient point-of-care DNA analysis. µ-TAS systems commonly consist of a microfluidic network for biochemical reactions in a device often made of glass or polymers. In addition, they require a number of control components such as heaters and temperature sensors for thermocycling, microvalves for separating functional units, actuators and micropumps for sample transportation and flow control and magnets for sample manipulation using beads. The adoption of lab-on-a-chip assays to routine DNA-based diagnostics has been slow. This is mainly due to the difficulty of integrating multiple functional steps required in genotyping assays, in particular sample treatment and signal detection (Whitesides 2006). The simple assay formats and short reaction times of single-reaction assays make them attractive. Consequently, the commercially available biochips use electrophoretic separation and/or homogenous amplification methods for analysing single/low numbers of markers. Two such examples are the chips provided by Agilent Technologies for high resolution capillary electrophoresis in the Bioanalyzer, and Fluidigm’s (http://www.fluidigm.com) digital real-time PCR assays for quantification of DNA sequencing libraries for subsequent MPS (White et al. 2009). Integrated multiplex genotyping assays often rely on separation of several homogenous reactions in different channels (Summerer et al. 2010). Genetic tests for complex diseases will require multiplex genotyping of multiple SNPs in multiple genes in each DNA sample. Ultimately, targeted clinical analysis of genes/mutations will be replaced by large-scale sequencing of disease gene pathways and networks. Thus, there is a need for completely automated microfluidic devices with several reaction steps and a solid-phase support for separation of reagents and analytes during the process. To increase assay complexity while reducing hardware and system complexity as well as costs per test, is the true challenge. In Study III we have developed a multiplexed genotyping assay for cancer diagnostic in a biochip using freeze-dried reagents for simplified reaction integration. Coding and decoding Previously, radioactivity and/or gel electrophoresis was the primary detection method of nucleic acid assays (Syvänen et al. 1993). Today, commonly used features for labelling are fluorescent molecules (Pastinen et al. 1996), specific mass of molecules (Karas and Hillenkamp 1988; Kim et al. 2004), the use of chemiluminescent enzymes (Nyren et al. 1993) and various sequence tags. The assay format influences the choice of allele label. Since homogenous assays do not include any separation steps, the label must indicate when an allele-specific reaction has occurred. These molecular changes can with favour be detected using interacting fluorophores and observed by for example fluorescence resonance energy transfer (FRET). To be able to score different al37 leles of multiple products, unique coding of each molecule is necessary. The integration of a variant label and a locus label can be achieved during the process of the experiment and during assay design. Parallel read-out of individual signals depends on physical separation of products by for example capillary electrophoresis or on microarrays. Detection methods include mass spectrometry and microscopes or high-resolution scanners with charged coupled device (CCD) cameras or photomultiplier tubes (PMT). Data analysis and quality control Large-scale genetic studies such as genotyping, sequencing, and gene expression analysis generate large amounts of data. In particular the output from MPS machines is increasing dramatically. This is accompanied by challenges for massive scale data storage and information technology to ensure proper data management and interpretation of huge data quantities (Pop and Salzberg 2008; Horner et al. 2010). Computational analyses of complex data sets require (i) management of data including image analysis, signal processing and background subtraction in an automated manner with minimal levels of manual input and visual inspection (ii) good tools and analysis strategies for base calling, read alignment and variant identification or allele scoring (iii) automated and objective quality control. Further, downstream and application specific interpretation of data is needed for finding biological relevant answers. In Study I and II of this thesis, tools for automated allele calling and data quality control were developed. 38 Aim of this thesis The overall aim of this thesis was to develop and apply strategies based on single base primer extension (SBE) to study sequence variation in the fruit fly and in humans. Specifically the aims were: • To improve the performance of the tag-array minisequencing system for SNP genotyping and implement it in novel applications. • To use the Genome Analyzer for massively parallel sequencing to evaluate two hybridization-based methods for targeted sequence capture. 39 The present study As outlined in the introductory part, the analysis of sequence variation requires database resources to support assay design, approaches for reducing genomic complexity to achieve sensitive and specific assays, robust principles for allele discrimination, multiplex assay formats, and tools for simple data analysis and assay quality control. The studies included in this thesis address some of the issues listed above for the discovery and analysis of single nucleotide polymorphisms (SNPs) and contain technical improvements and new applications of methods based on four-color single base primer extension (SBE) chemistry. Tag-array minisequencing The tag-array minisequencing system for SNP genotyping was used in Studies I-III of this thesis. It is based on polymerase-assisted minisequencing/SBE for specific and robust allele discrimination and a tag-microarray format that allows medium to highly multiplex and parallel analysis (Lindroos et al. 2002) (Figure 6). A major advantage of tag-array minisequencing is its four-color detection system which enables the interrogation of the four DNA bases, and hence any SNP variation, in a single reaction. This feature distinguishes tagarray minisequencing from commercially available SBE methods. Thus, the system is flexible and can be designed for any panel of custom selected SNPs. It is based on standard freely available equipment and reagents, making it easy to implement in any regular laboratory. As compared to the large commercial genotyping platforms, the tag-array minisequencing system is extremely cost efficient per DNA sample, enabling genotyping of 200 SNPs at a cost of about 1.4€ per sample. The genotyping system has many convenient characteristics that can be utilized in specialized applications and in novel formats. It is particularly useful for establishing genotyping panels for other organisms than humans, where common predefined assays are not readily available. Reaction steps and assay set-up The tag-array minisequencing system consists of four different biochemical reaction steps, signal read-out, and data analysis. In addition it involves a number of preparative steps. Figure 7 outlines the method, and the section 40 below gives a brief overview of the system. Detailed information can be found in Study IIb and Lindroos et al (Lindroos et al. 2002). Genotyping requires an initial amplification of the target sequences containing each SNP by multiplex PCR. As for many other genotyping technologies the optimization of the PCR reactions is the bottleneck of the system and the PCR comprises about half of the cost per genotype. By careful optimization and also by using a “brute force” approach we were able to establish PCR reactions amplifying up to 43 genomic fragments in parallel in Study IIa. Primer oligonucleotides, preferably amplifying short fragments, were designed with the software Primer3 or Autoprimer (http://www.autoprimer.com). The remaining dNTPs and primers from the PCR reaction mixture are inactivated by treatment with Exonuclease I (ExoI) and shrimp alkaline phosphatase (SAP), prior to genotyping. The exonuclease activity of ExoI degrades the single stranded primers by cleaving off nucleotides in the 3'- 5' direction and SAP removes the 5’-phosphate groups of the nucleotides. This enzymatic clean-up eliminates the need for any washing step or physical separation for product purification. Figure 6. (A) The tag-array minisequencing system, based on single base primer extension in solution with fluorescent ddNTPs. (B) The primers contain 5´-tag sequences for capture of the minisequencing reaction products on generic microarrays carrying complementary c-tag sequences in an “array-of-arrays” format. The image illustrates one sub-array/sample showing the hybridization of extension primers for two different SNPs with one being homozygous and the other being heterozygous in that particular sample. 41 Multiplex minisequencing reactions are performed in solution for high efficiency and are cycled to increase the number of extended primers. The minisequencing primers are designed to anneal immediately adjacent and upstream of the SNP site and are extended with fluorescently labelled ddNTPs (Figure 6A). To limit mis-incoporation in the reaction, a specific DNA polymerase with a 3´-5´ exonuclease proof-reading capability is used. Each minisequencing primer is designed to contain a 5’ tag sequence for hybridization and identification on the microarrays. In Studies I-III, minisequencing primers were designed for both DNA polarities of each SNP using the SBE primer software (Kaderali et al. 2003) and the primer-tag combinations were evaluated with the software Autodimer (Vallone and Butler 2004) to avoid oligonucleotide sequences forming secondary structures. Tag sequences comprise 20 nt in length which result in 420 or 1012 possible unique probe combinations that enable highly multiplex analysis. In the studies included in this thesis between 11 to 119 different clinical mutations, SNPs or InDels were genotyped in parallel in each single reaction. Figure 7. Procedure for assay set-up and genotyping of SNPs by cyclic minisequencing with capture on Tag-arrays. Required preparative steps are specified on the left whereas the genotyping procedure is given on the right. For multiplex signal detection, the SNPs are separated by hybridization to glass microarrays with immobilized capture oligonucleotides (c-tags). The extended minisequencing primers are via their tag sequences captured to complementary c-tags with known locations on the arrayed surface. The use of 42 generic tag sequences enables universal, non-SNP specific array designs. In the studies included in this thesis we use an “array of arrays” format where the parallel capacity of the microarray has been developed further (Pastinen et al. 2000). It enables multiple SNPs to be interrogated in several samples, analysed in individual sub-arrays on the same microscope slide (Figure 6B). This format is accomplished by a silicon rubber grid creating separate reaction chambers. Either 16 or 80 samples can be simultaneously analyzed for up to ~580 or ~200 SNPs respectively, generating as much as ~16,000 genotypes per slide. The “array of arrays” format is particularly well suited when comparing reactions or assay performance (Lovmar et al. 2003), (Hultin et al. 2005), as in Study I and III, where both versions of the assay format have been utilized. C-tags to be attached to the microarray slides contained 15 T-residues as spacers and a NH2-group at their 3'-ends by which they are covalently coupled to CodeLink™ Activated Glass Slides. The microarray slides were printed with spots of around 150 µm in diameter and a spot-to-spot distance of 200-210 µm. The microarrays are scanned at four wavelengths to measure the signal intensities of the incorporated fluorescently labelled ddNTPs, and allow deduction of the genotypes of each SNP. The signals are quantified and genotypes are assigned based on scatter plots with the logarithm of the sum of both fluorescence signals, log(a+b), plotted against the signal ratio or fraction a/(a+b). The measured intensity of the incorporated nucleotides is visualized as three genotype clusters. We used the custom software SNPSnapper for genotype assignment. In Study II, automatic allele calling was also performed with the custom developed SNPmapper software. Silhouette scores for quality assessment of genotype data Multiplex genotyping assays require good and automated quality control. Quality control is performed both on the sample level and by including determined oligonucleotides in the reactions. Moreover, the level of deviation between the observed and the expected distribution of the results and the segregation of alleles are commonly used quality measures. In the tag-array minisequencing system, oligonucleotide probes are included at each step of the reaction protocol for comprehensive quality control (Lovmar et al. 2003). In Study I of this thesis we addressed quality control of genotype data, and developed a general tool for automatic evaluation of allele calling and quality assurance of SNP genotype clusters. The quality measure was applied to evaluate different DNA polymerases (Study I) and freeze-dried reaction mixtures (Study III) in the tag-array minisequencing system. 43 Silhouette scores “Silhouettes” were established as a graphic aid for interpretation and validation of cluster data (Rousseeuw 1987). Silhouette scores provide a measure of how well a data point is classified when it is allocated to a cluster. This makes them well suited for assessing the quality of genotype results when genotype assignment is performed with data based on scatter plots. A Silhouette takes into account both the tightness of the clusters and the separation between them. A robust SNP genotyping assay is characterized by large distances between the three genotypes clusters and small distances between the data points within the same cluster. Figure 8. Scatter plot used for genotype assignment in the tag-array minisequencing system illustrating the principle for the Silhouette s(i) calculations for a given data point (i). The lines indicate the average distance to all points in the same cluster a(i) and to all points in the other two clusters b1(i) or b2(i). The Silhouette calculation for one data point in a typical scatter plot obtained by a SNP genotyping assay is illustrated in Figure 8. For each given data point (i) in a scatter plot, the Silhouette s(i) is defined as: where a(i) is the average distance to all data points in the same genotype cluster and b(i) is the average distance to all data points in the closest cluster. Max and min in the formula denote the largest or smallest of the measures in the brackets. The average Silhouette values are calculated for each cluster (average Silhouette width) and for each SNP assay (Silhouette score) to generate a single quality measure for each SNP assay ranging from 1.0 to - 1.0. When the average distance from a data point to the other data points within the same cluster is smaller than the average distances to all data points in the closest 44 cluster, a Silhouette value close to 1.0 is obtained. A value close to zero indicates that the data-point could equally well have been assigned to the neighbouring cluster. A negative Silhouette is obtained when the cluster assignment has been illogical, and the data point is actually closer to the neighbouring cluster than to the other data points within its own cluster. We developed the program ClusterA to calculate numeric Silhouettes for clustered data obtained from fluorescence signal ratios. With the program the distance between data points can be measured either in one dimension, for example along the x-axis, or in two dimension using vectors, as shown in Figure 8. The program also provides the mean, variance and F-statistic for the input data. Enzyme performance Several DNA polymerases have been modified to incorporate fluorescent ddNTPs and dNTPs at equal efficiency in Sanger sequencing. The ThermoSequenase DNA polymerase was engineered to incorporate ddNTPs with an approximately two-fold preference over dNTPs (Tabor and Richardson 1995) and recently additional thermostable enzymes have become available that are compatible with fluorescent ddNTPs. Since the DNA polymerase is the major factor for determining the specificity and efficiency in the minisequencing reaction, a comparison of different DNA polymerases was of interest. In Study I we compared the three DNA polymerases TERMIPol, Therminator and Klen-Thermase to the ThermoSequenase enzyme, previously routinely used in our tag-array minisequencing system as well as in many other laboratories. ClusterA was applied to calculate Silhouette scores on genotype clusters for 26 SNPs acquired when comparing the performance of the four DNA polymerases with the tag-array minisequencing system. All reactions were performed at the same reaction conditions with optimized protocols regularly used for genotyping with the ThermoSequenase enzyme. The "array of arrays" format (Pastinen et al. 2000) was useful for the comparison and enabled us to test all enzymes on the same microscope slide. We applied a non-stringent genotyping approach to be able to detect differences between the enzymes. The performance of the DNA polymerases was evaluated by comparing genotyping success, fluorescent signals, signal-to-noise ratios, and cluster quality (Table1) for each enzyme and SNP. Figure 9 illustrates genotype clusters from three SNP assays that yielded different Silhouette scores in our study. Based on our evaluation we concluded that SNP assays with Silhouette scores >0.65 may be accepted automatically while assays with Silhouette scores <0.25 should be failed. Assays with Silhouette scores falling between these values could be accepted or failed after 45 manual inspection. These recommendations are in line with Liu et al., who included Silhouette calculations in the algorithm used to interpret the data from the Affymetrix 10K HuSNP hybridization microarray (Liu et al. 2003). By using 0.65 as cut-off, 70–76% of the SNP assays would have been successful in this study. Due to the non-stringent genotype calling strategy very low Silhouette scores were obtained for some SNP assays, which normally would have been failed. Figure 9. Silhouette scores for genotype clusters from three SNP assays. The results are from 16 samples genotyped in duplicate. Squares and circles correspond to homozygotes for allele 1 and allele 2 respectively, and triangles are heterozygote samples. The SNPs are denoted by their dbSNP identification number, and the DNA polarities analyzed. Silhouette scores are shown in the right hand upper corner of each panel. Table 1. Enzyme performance Silhouette Silhouette score score (average) (highest) Genotype calls1 S/N2 (average) Correct Errors n % n % n % TERMIPol 0.72 20 25.3 2337 98.9 18 0.8 4.3 Therminator 0.69 15 19.0 2323 98.3 32 1.4 3.6 KlenThermase 0.74 22 27.8 2346 99.3 10 0.4 8.0 ThermoSequenase 0.71 22 27.8 2324 98.3 34 1.4 8.9 1 Number of genotype calls (n) and call rate (%). 2 S/N were calculated by dividing fluorescence intensity values from nucleotides corresponding to a true genotype (signal) by the fluorescent intensity value from the remaining ddNTPs (noise). Both signals and noise were measured from the same spot. All four enzymes performed satisfactory in our minisequencing assays. KlenThermase showed the highest average Silhouette score, the tightest distribution of Silhouette scores and generated more correct genotypes than the others (Table 1). In Figure 10 below, the average fluorescent signal intensities are displayed with respect to SNP genotype and nucleotide, illustrating the true and incorrect signals generated by each enzyme. The signal intensity levels were comparable apart from for the T-signals which were lower for two of the 46 enzymes. In general the levels of correct signals were high for KlenThermase with little mis-incorporation. In conclusion, KlenThermase showed the over all best results in our study at a competitive price. Figure 10. Average fluorescent signal intensities with respect to SNP genotype. Intensity levels are plotted on the y-axis for each ddNTP type and SNP genotype. Signals for homozygotes (AA, CC, GG and TT), heterozygotes (AX, CX, GX and TX) and from mis-incorporated ddNTPs of each nucleotide type (XX) are given for each DNA polymerase. Tools for gene mapping in Drosophila melanogaster For efficient gene mapping in the model organism Drosophila melanogaster a map position needs to be refined from a whole chromosome or chromosome arm to about 50 kb, which is the practical limit for recombination mapping. The resolution of available SNP maps in Drosophila were until recently insufficient for fine-mapping and the evolving sequencing technologies still too costly for large screens. Therefore, we developed a dense SNP map and a set of tools in Study II, to accelerate large-scale gene mapping. All laboratory steps and the protocols are thoroughly described in Study IIb. 47 SNP map and FLYSNPdb A high-resolution SNP and InDel map was established and organized in the publically available FLYSNPdb. A detailed description of the SNP map and database creation can be found in Publication IIa and Chen et al (Chen et al. 2009). In short it included amplification and sequencing of fragments in nonrepetitive, non-coding regions equally distributed along each major chromosome arm of Drosophila (X, 2L, 2R, 3L, and 3R). Three to six divergent Drosophila strains for each chromosome were chosen according to common usage in genetic screens and suitability for mapping purposes. We identified a total of 109,369 SNPs (2,244 amplifiable SNP marker regions). The average distance between SNP markers and thus, the resolution of the map is 50.2 kb, which on average comprises 6.3 genes. Marker and strain information are found in the FLYSNPdb via the FlySNP homepage (http://flysnp.imp.ac.at). Genome-wide genotyping assays for gene mapping To establish a resource for fast gene mapping in Drosophila, we used the information from the high density SNP map to design a panel of 295 genotyping assays for SNPs and InDels, equally distributed across the Drosophila genome, with an average spacing of 390.5 kb. Initially, a larger number of genotyping assays were designed and the most optimal and best performing assays were included in a final set. Two fly samples with divergent genotypes were used for each chromosome. Strain and assay information for the final set is included in Table 2 below. Table 2. Genotyping assays Chr Fly strains X 2 3 CSisoX (Bloomington #6364) Bloomington #98 CSiso2/CyO ORiso2/CyO CSiso3 (Bloomington #6366) Bloomington #576 SNP assays 58 2L = 59 2R = 60 3L = 58 3R = 60 58 Multiplex PCR reactions 5 Fragments per reaction 13 5-14 11 3-16 5-23 Both the PCR reactions as well as the genotyping assays were designed in a highly multiplex fashion using a “brute-force” approach. For this purpose we fully utilized the flexibility and the two versions of the “array-of-arrays” format. Initially the genotyping assays were set up in the format which enables genotyping of a larger number of SNPs and subsequently the format was redesigned for genotyping 80 recombinants in the final set of SNPs on each microarray slide. One microarray slide for chromosomes X, 2, and 3, respec48 tively, were established for discriminating between the selected stocks (Figure 11). If using other strains for mapping, the established microarrays still are potentially useful, with a varying loss of informative SNP assays. Moreover, new minisequencing assays can easily be established from the SNP map, if a different SNP set needs to be tested. Figure 11. Scanning images for one sub-array from each of the three established microarrays. The visualized sub-arrays represent the samples CSisoX, CSiso2 and CSiso3 respectively. Each sub-array is scanned at four different wavelengths corresponding to the fluorescent label of each nucleotide, indicated in the upper left corner of each image row. SNPmapper for simplified data analysis The combining of the microarray data with the phenotypic information is not trivial. We therefore designed a software tool called SNPmapper to automate the interpretation of the genotype data and the transformation into a genetic map position. SNPmapper is a platform-independent, freely available software for allele-calling and for correlating and displaying genotypes and phenotypic 49 information in a user-friendly and colour-coded graphical output. The recombinant chromosomes are sorted by phenotype and breakpoint, illustrating all recombination events. Genotype calling is achieved by applying a clustering algorithm on the fluorescence intensity values. Custom options for background corrections and exclusion of poorly performing samples are also available. Resources for fine-mapping The mapping resolution of the minisequencing method was 3 Mb on average. The identification of the gene of interest thus requires further fine-mapping. Fine mapping can be achieved by (i) comparison to or complementation with known mutants (ii) using chromosomal deficiencies or P-element insertion mutants (iii) low scale SNP genotyping (iV) RNAi on genes in mapped region. Two different tools for fine mapping were established in Study II. We generated fly stocks containing two closely-linked EP elements at defined genomic intervals. The recombination events in the segment between the EP insertions are detectable by eye-colour. Three to 16 EP lines are available for the different chromosome arms, which makes it possible to narrow down the mutated region to a 10 - 20 kb interval. A set of 1400 additional genotyping assays were also pre-computed for local genotyping of the 2EP intervals. Mapping of genes underlying muscle development To verify the potential of the resources, we mapped mutants from a screen for genes required for Drosophila embryonic muscle morphogenesis. To screen for genes on chromosome 2, ORiso2 lines were mutated and muscle defects were observed. Totally 140 muscle mutant lines were recovered. About 100 recombinants per mutant were created for mapping with the minisequencing system by crossing the mutagenized stocks with CSiso2. Totally, 1713 samples were genotyped in the screen and more than 160 000 genotypes were generated. Of these, 549 samples were typed in the most dense assay panel at the time, comprising 115 SNPs on chromosome 2. The success rate for these assays was 90.56 %, and the assay accuracy for 4-5 control samples was 99.67%. The average final resolution of the mapping method was 2.8 Mb, which is determined both by the density of successful genotyping and by the number of recombination events. The gene Sticks and Stones, known to play a role in muscle morphogenesis, was used as a positive control for the screen. The gene was mapped correctly to a region of 2.3 Mb which demonstrated the robustness of the presented SNP mapping tools. By mapping 14 genes in less than four months, the practical feasibility of our system was shown. 50 Tag-array minisequencing in a lab-on-a-chip In Study III we established genotyping assays with the tag-array minisequencing system to be implemented in a lab-on-a-chip assay. Assays for multiplex DNA analysis are important for point-of-care testing in the doctor’s office and for other types of on-site applications. As a relevant model for a diagnostic chip the prototype panel was developed for five mutations in the gene encoding the tumour suppressor protein (TP53), which are the most frequently mutated sites in common cancers (Petitjean et al. 2007; Petitjean et al. 2007). Additionally we included 13 common SNPs in the same gene for evaluation of the assay performance. Reaction protocols were established and optimized for each step of the genotyping procedure. The multiplex PCR reaction amplified 13 fragments containing the 18 markers, and genotyping was performed by an 18-plex minisequencing reaction. One obstacle for complete integration is liquid handling and external reagent supply. To enable integration into a microfluidic chip, we therefore sought to reduce the hardware complexity by storing all reagents needed in the genotyping protocol on the chip during fabrication. Since the tag-array minisequencing protocol is based on four consecutive reaction steps with no washing or separation of reagents in between, the storage of reagents on-chip abolishes the need for external reagent supplies. Freeze-drying of reagents One approach for simplifying on-chip storage is freeze-drying or lyophilisation of the reaction mixtures. Lyophilisation is a dehydration process which involves initial freezing of a substance followed by a reduction of the aqueous content at low pressure. Freeze-drying has been widely used by the food, pharma, and biotech industries to preserve and extend the shelf life of a product, and to enhance transportation due to reduced weight. The procedure has further been utilized to stabilize reagents for biochemical assays, mainly “master mixtures”, consisting of buffers, nucleotides and primers. Thermostable DNA polymerases have however also been freeze-died (Klatser et al. 1998), and applied for amplification-based pathogen detection (Aziah et al. 2007; Aitichou et al. 2008) or for genotyping in miniaturized assays (Focke et al. 2010). We freeze-dried reagents for the three enzymatic reaction steps of the minisequencing method; PCR, PCR-cleanup and minisequencing, as well as the hybridization buffer. These reactions include two different types of thermostable DNA polymerases used in the PCR and minisequencing reaction, and the thermo-sensitive enzymes Exonuclease I (ExoI) and shrimp alkaline phosphatase (SAP) for PCR purification. During freeze-drying the molecules are exposed to stress which may lead to reduced enzyme activity when rehy51 drated, mainly because of altered molecule structure. The addition of stabilizing substances, lyoprotectants, may enhance the enzyme stability and the recovery rate. Lyoprotectants might however also affect the overall reaction performance by themselves inhibiting the active molecule. We performed combinatorial real-time PCR experiments to investigate the activity of the DNA polymerase when freeze-dried both without and in the presence of a combination of the four lyoprotectants; trehalose, polyethylene glycol (PEG), mannitol and dextran, at different concentrations. The enzyme activity was high for all concentrations of lyoprotectants tested and no inhibitory or cooperative effects from lyoprotectants were observed. Although high performance was expected for the robust DNA polymerase, confirmed activity for the special SmartTaq enzyme suitable for multiplex PCR, was important. All reactions that included PEG resulted in a lower PCR yield and consequently we removed it from all mixtures. The stability of the active constituent over time is critical. When performing a long-term stability test on the DNA polymerase, we obtained nice PCR products using dry mixtures stored up to six month. This result was attained even when low concentrations of lyoprotectants were included in the mixture. The novel aspect of our study is the freeze-drying of the thermo-sensitive enzymes ExoI and SAP, which are expected to be more fragile and sensitive to the freeze-drying procedure. We performed accelerated stability tests on these enzymes, which enabled a prediction of the shelf life in a shorter time span. In our experiment freeze-dried samples of ExoI and SAP were incubated at elevated temperatures for five days, and the residual enzymatic activities were measured with two fluorescence assays. These experiments revealed a clear stability dependence both on storage temperature and lyoprotectant concentration. The experimental data was used to predict the half life of the enzymes to 50 days for ExoI and 200 days for SAP stored at-21ºC. Our results are the first to show the feasibility of freeze-drying ExoI and SAP. In the establishment of optimal reaction protocols for the integrated assay we also considered the effect of lyoprotectant accumulation in consecutive reaction chambers. Systematic genotyping comparison We next evaluated the freeze-dried reagent mixtures in the complete genotyping protocol. We performed a systematic genotyping comparison where reagents in each of the reaction steps were replaced with dry reagent mixtures in a stepwise manner. Twelve different reagent combinations were investigated, using one cell-line DNA sample (Figure 12A), and analysed on the same microarray to minimize technical variation. Highly identical results are illustrated by the scan images for combination 1 and 12 using either liquid or freeze-dried reagents in each reaction step (Figure 12B). These results are persistent also through quantitative evaluation of the signal intensities and 52 background noise (Figure 11C) as well as signal to noise ratios (S/N) (Figure 11D). The experiment did not indicate a particular freeze-dried reagent that is critical for the success of the assay. Figure 12. (A) Combinations of freeze-dried and liquid reaction mixtures tested. Fluorescent scan images at four wavelengths for mixture combination 1 and 12, and signal and background levels (C) or S/N ratios (D) for all SNPs in each of the 12 reagent combinations. We genotyped DNA samples from ten blood donors using either freeze-dried or liquid reagents in each reaction step in two replicate experiments (LIQ1, FD1 and LIQ2, FD2, respectively). The median S/N values were slightly lower using freeze-dried reagents than with liquid reagents (Figure 13A). The concordant results for the replicate experiments indicated a high reproducibility of the genotyping process using freeze-dried reagents. All different genotype classes were obtained, which enabled us to determine the genotypes for theses samples and to evaluate the cluster quality. The cluster quality of successful assays was assessed by calculating numeric Silhouette scores, described in detail on page 43 of this thesis and in Publication I. Median Silhouette scores of 0.72 were obtained in the experiments performed using liquid reagents, but acceptable scores (median = 0.50) were also achieved in the experiments with freeze-dried reagents (Figure 13B). An average of 175 genotypes out of totally 180 (97.2%) were successfully called in the two experiments employing liquid reagents and in the experiments with freeze-dried 53 reagents the corresponding number was 144 (80%). The genotype concordance was high in our experiments. Of the successfully called genotypes only three discrepant results were observed, all using freeze-dried reagents. Figure 13. (A) S/N ratios and (B) Silhouette scores for all SNP assays genotyped in ten blood donors. Two replicate experiments were performed using either liquid (Liq1 and Liq2) or freeze-dried reagents (FD1 and FD2). Genotyping in micro-fabricated structures On the way towards full integration, we transferred the biochemical reactions of the multiplex genotyping protocol to micro-fabricated test chambers. All reaction mixtures were freeze-dried in dedicated reaction chambers. The 18 SNPs and mutation sites were genotyped in triplicate reactions using a cellline DNA sample at the same conditions as for the procedure in test tubes. The reaction products were manually transferred between chambers for each reaction and the extended minisequencing primers were hybridized to a conventional glass microarray for signal read-out. Reaction chambers were milled on a cyclic olefin copolymer (COC) substrate (Figure 14A). A special chamber sealing solution was developed that allowed thermocycling of the chambers at high temperatures as well as easy product recovery. Thermocycling and incubation was achieved by a custom-made temperature actuator with a Peltier element for active heating and cooling, on which the reaction chamber was fixed. The design and fabrication of the hardware components is detailed in Publication III. The multiplex PCR product generated in the first test chamber was compared to a PCR product obtained using freeze-dried reagents in a test tube, and found to be comparable (Figure 14B). As shown in figure 14C, the genotyping products generated in consecutive chambers were a proof-ofprinciple for successful transfer of the multiplex genotyping reactions, using freeze-dried reagents, to the microchip format. This result also confirmed that the freeze-dried reagents are compatible with the polymer material, commonly 54 used for the fabrication of microfluidic devices. Through the European SMART-BioMEMS project the goal is to implement the deposition of freezedried reagents in a fully integrated lab-on-a-chip system for diagnostic genotyping including sample pre-treatment and fluorescence detection (Figure 14D and 14E). Figure 14 (A) Reaction chambers milled in a COC substrate to generate single chamber chips. (B) Electropherogram showing multiplex PCR products generated with freeze-dried reagents in COC chambers (left) and test tubes (right). (C) Fluorescent scan images for triplicate genotype assays performed in COC chamber and scanned at two wavelength for Tamra-ddCTP (top) and Cyanine5-ddUTP (bottom). (D) Integrated COC chip containing reaction chambers for all biochemical reactions including sample treatment and array hybridization. (E) Reaction control system. The devices shown in panel D and E were developed in the SMART-BioMEMS project (project no: IST-016554). 55 Sample complexity reduction for re-sequencing candidate genes The Genome Analyzer system from Illumina is based on the same SBE reaction principle as used in the minisequencing system. The four-colour chemistry has been developed further to allow massively parallel sequencing (MPS) of short DNA fragments. The system uses an engineered high-fidelity enzyme for enhanced read accuracy, and sophisticated reversible terminating fluorescent nucleotides. In Study IV, we utilized the sequencing system to evaluate two methods for targeted MPS of candidate genes and variant discovery; Nimblegen capture arrays from Roche and the SureSelect capture system from Agilent. The systems are described in more detail on page 23 of this thesis and in Figure 3. The goal of the study was to identify the optimal enrichment method for continuous, custom-selected regions for sequencing using the Genome AnalyzerIIx. Both systems are commercially available to be used by any researcher. The assay design is included in the service and only requires the coordinates for the desired target regions from the customer. The methods are both based on target selection by hybridization conducted either in situ on a microarray, or in solution. Our comparison experimentally evaluated the complexity reduction in relation to probe design, successful target capture and correct variant calling. Study design We aimed at targeting the sequences of 56 genes, including exons, introns and 5 kb of non-coding sequence flanking each gene, which made up a total target region of 3.1 Mb. The SureSelect capture probes were designed to tile the target region with 30 bp offset while the tiling was denser (five bp offset) with the Nimblegen probes. A difference in the design of the two probe sets is the way repetitive sequences are defined. The SureSelect method uses the repeatmasking algorithm to exclude defined sequence motifs with low complexity. The Niblegen method on the other hand, uses the occurrence of each possible 15mer in the probe sequence as a measure of complexity, and discards everything above a certain threshold. A custom option for repeat-masking does exist for the SureSelect method. The difference in probe design was reflected in our experiments by the difference in size of the probe covered region but also by lower capture success of low complexity regions. We prepared standard sequencing libraries from cell-line DNA of three related HapMap individuals. Library molecules were hybridized either to oligonucleotide arrays or to probes in solution that were captured to beads, eluted and amplified by universal PCR. All samples were sequenced on a single lane of an Illumina flow cell in the same 36 cycle GA run. Image analysis, base calling, read alignment to the human reference sequence and SNP calling was performed with the Illumina analysis pipeline and with open source programs and scripts. 56 Capture quality The capture specificity or the portion that aligns to the target region defines the over-sampling required for sufficient coverage of a sequencing library. The average specificity was higher for the SureSelect captured libraries compared to the Nimblegen libraries (Table 3), and thus approximately 1.4-fold more over-sampling is required when using the array method. More consistent results between the SureSelect libraries also indicated a better reproducibility. Table 3. Average performance Probe cov- Specificity Sensitivity1 ered region (%) (%) (Mb) Nimblegen SureSelect 1 2 2.4 (78%) 1.7 (55%) 54 74 98.5 (76) 99.8 (55) Seq. Coverage2 Norm. depth >10x (%) coverage (%) 74 91 (93) 68 157 98 (98) 68 Probe covered region, total target region in parenthesis Probe covered region, region covered by both probe sets in parenthesis We also determined the method sensitivity or the completeness of coverage. The sensitivity is defined as the fraction of the target region which is captured and covered at least once (Table 3). When the probe targeted regions were considered, the SureSelect libraries showed a higher sensitivity. Nimblegen libraries however showed very similar levels of completeness for the region targeted by both probe sets, and covered more of the original target region. The mean sequencing depth was notably higher for the SureSelect libraries (Table 3) and the uniformity of coverage was slightly higher. The normalized coverage, which is independent of the sequencing depth, was however similar for both methods and even higher for the Nimblegen libraries. Of the targeted bases 67-69% had at least half of the average coverage in the SureSelect libraries and for the Nimblegen libraries the same number was 61-72%. Variant calling SNPs were called by comparison to the human reference sequence. When requiring a minimum coverage of ten reads, we were able to call between 2536 and 3371 SNPs per sample. Due to the larger probe-covered region, the number of called SNPs was higher in the libraries that were prepared using the Nimblegen method than in the ones prepared with the SureSelect method. The opposite was seen when SNPs in the region targeted by both methods was taken into account, probably due to higher sequencing coverage. We compared the called variants in our sequencing data to the genotypes from HapMap SNPs, annotated SNPs in the Ensembl variation database, and variants in 57 the 1000 genomes project pilot I data release (April 2009), in the target region. The called HapMap SNPs in the probe specific regions and in the region covered by both methods followed the same pattern described above for the total number of identified variants. In total 69-181 novel SNPs were identified per sample of which two to 15 were annotated to exons. Two of the sequenced samples are part of the 1000 genomes cohort which explained the low number of novel SNPs found in our study. Our identification of additional novel variation however, emphasizes the need for high coverage sequencing, to disclose all sequence variation within a region of interest. For a number of SNP calls the position of a SNP agreed, but the base disagreed with the HapMap SNP data. Eleven positions that disagreed also after manual inspection were analysed with Sanger sequencing. In ten cases, our GAIIx sequencing calls were validated, while in one case the Sanger sequencing result agreed with the HapMap data. A set of novel SNPs were also Sanger sequenced, which confirmed our SNP calls in eight cases, while two positions had been called erroneously in the Nimblegen libraries. We evaluated the inheritance pattern of novel SNPs and used the errors to estimate the false SNP discovery rate to 0.5% for the SureSelect library and 0.7% for the Nimblegen library for the sample NA10860. The combined results in Study IV demonstrates the feasibility to use hybridization-based sequence capture and sequencing by GAIIx, to reliably detect known and discover new SNPs in the targeted region. 58 Concluding remarks Single base primer extension (SBE) is the most widely applied reaction principle in genotyping and sequencing systems. It is the basis of both the tagarray minisequencing system and the Genome Analyzer, which have been utilized in the studies included in this thesis. Tag-array minisequencing provides robust genotyping at intermediate multiplexing levels and sample throughput. We have shown its flexibility in Studies I-III, by employing and adapting it to novel applications. The software tool which we developed for calculating Silhouette scores in Study I, can with favour be used for objective validation of the quality of SNP genotyping assays. The calculation of Silhouette scores can be implemented in large-scale genotyping systems where automated quality scoring is desirable. A similar algorithm is indeed used for data from Affymetrix, and in the Illumina Infinium assays. Moreover, Silhouette scores are convenient to utilize during assay development and optimization. As in Study I and III, the measure can be applied to compare assay performance at different reaction conditions, or the results attained with different SNP genotyping technologies. In our research group we have used Silhouette scores to evaluate different methods for whole genome amplification (WGA), to be used as templates in the tag-array minisequencing system (Lovmar et al. 2003). The program that we created for calculating Silhouette scores is freely available. It can be used for quality assessment of the results from all genotyping systems, where the genotypes are assigned by cluster analysis using scatter plots. The same algorithm has been implemented as a function in the cluster package in the R programming environment (http://www.R-project.org) (R Development Core Team 2010), for large-scale data analysis. With the complete set of mapping resources that we developed in Study II, we significantly reduced the time needed for large-scale gene mapping in Drosophila. The map of high-density sequence polymorphisms improved the resolution, and increased the speed of genome-wide mapping and cloning of genes. Further, our tag-array minisequencing system in the special “array of arrays” format is particularly well-suited for parallel genotyping of an intermediate number of samples and SNPs in a cost-effective, open source manner. The creation of the automatic allele calling software makes it easy to identify informative recombinants, which usually is the rate-limiting step in a mapping 59 project, and to quickly extract the relevant recombinants from a large number of samples. The assays have been utilized to identify the transmembrane protein, Kon-tiki (Kon), which mediates myotube target recognition in the Drosophila embryo during muscle development (Schnorrer et al. 2007). Our successful transfer of tag-array minisequencing assays to microfabricated structures in Study III, shows the potential to develop a completely integrated lab-on-a-chip for multiplex genotyping of cancer mutations. We found that freeze-drying of reagents and storage on the biochip, is a good means for reducing the complexity of the multi-reaction DNA test. It is the first time that all reagents needed in the genotyping reactions, including the thermo-sensitive enzymes, have been freeze-dried and applied in a miniaturized assay. This achievement is critical also for other enzyme-assisted genotyping assays. The tag-array minisequencing system consists of sequential reactions without any separation steps, which reduces the complexity of the control-hardware needed. Miniaturized assays have convenient characteristics for simple on-site DNA tests in healthcare, at crime scenes, at farms, or in environmental field studies. Biochips containing freeze-dried reagents are robust and promising for use outside of controlled lab environments. Important applications therefore include point-of-care testing in developing countries with limited infrastructure. The Genome Analyzer system offers large-scale sequencing of short DNA fragments. The system can be used for whole genome sequencing or resequencing of regions of interest, as in Study IV of this thesis. Target enrichment techniques are powerful but increase the cost for sample preparation and experiment time. With increased sequencing throughput, reduced cost per sequenced base, and simplified data analysis, whole-genome sequencing will eventually overtake targeted sequencing for the analysis of individual samples. The analysis of sub-parts of the genome will however remain relevant for genotype analysis of candidate regions in large sample sets, and for clinical sequencing. The two capture methods which we evaluated in Study IV, both detected known SNPs, and discovered novel SNPs in continuous target regions. Our comparison demonstrated the feasibility for complexity reduction of sequencing libraries by hybridization-based capture methods. The strengths of the different methods should be taken into account in future development of enhanced methods for complexity reduction. As DNA sequences with lower complexity and variants within repetitive stretches can be biologically relevant, it will be essential to analyse all the variation in a given region. We therefore conclude that probe design that enables better coverage across entire target regions, will be increasingly important. 60 Massively parallel sequencing, often based on sequencing-by-synthesis, will without doubt have a revolutionizing impact on genomics and genetics, and fundamentally change our understanding of model organisms, and ultimately of ourselves. These technologies are promising for comprehensive scanning and screening of all types of genetic variation, or DNA modifications, and will most probably substitute many techniques for genetic analysis used today, including genotyping techniques. Such variant analysis enables further characterisation of common complex disease and normal variation, and the identification of rare genetic variants and their affects on the phenotypic level in different populations or individuals. The same techniques will also be important for clinical genetics and facilitate more “personalized medicine”, and further transform population genetics, microbiology, evolutionary genetics and the characterization of ecological diversity (metagenomics). The sequencing race continues! The X Prize Foundation (http://www.xprize.org) promotes the sequencing of 100 human genomes within ten days, or less at high accuracy, and at 10,000USD per genome. The “third generation” sequencers are already advancing. Important features are the direct analysis of single molecules in the full complexity of a whole human genome. Kilobase pair long reads simplify assembly, and make it possible to interrogate repetitive regions, and to reveal haplotype information. The sensitive reactions allow direct detection of base modifications during the sequencing process, and in situ sequencing in tissues. Both basic research and translational efforts are required for biomedical progress. Therefore, I have enjoyed working with the development of new applications and tools, which have the potential to be used and adapted by many researchers. The tremendous development in molecular technology has made this a very exciting time for performing my research studies. 61 Acknowledgements The work in this thesis was performed in the group for Molecular Medicine at the Department of Medical Sciences, Uppsala University. It was possible through financial support from the Swedish Research Council for Science and Technology (VR-NT) and by grants from the EU Sixth Framework Program (FLYSNP and SMART-BioMEMS projects). There are many people who have been involved in and contributed to this thesis and to whom I would like to express my gratitude. Some of you I would especially like to mention here. My supervisor professor Ann-Christine Syvänen. For giving me the opportunity to be part of your lab and for supporting my Boston intermezzo. For all your encouragement and you solid biochemical knowledge. My second supervisor Mats Nilsson. For your interesting work in the field of molecular technology development. It has inspired me. This thesis is a result of many collaborations, partly due to our involvement in the EU- projects. I would like to thank the partners in the lab of Barry Dickson in Vienna; Doris Chen, Frank Schnorrer and Irene Kalchhauser, as well as the team from Hungary. Further I would like to thank the whole group involved in the SMART-BioMEMS project and especially Monica Brivio, Anders Wolff, Henrik Brus and Massimo Romani. Due to these projects I have learned a lot beyond SNP genotyping. The Church lab in Boston for the warm welcoming and the inspiring environment during my research exchange. All my research colleagues and co-workers. Previous members of the Molecular Medicine group. Lotta Rönn, Andreas Dahlgren and especially Lovisa Lovmar, Mats Jonsson, Lili Milani, Cecilia Oksanen and Snaevar Sigurdsson for the collaborations on the publications included in this thesis. A special thank you goes to Raul Figueroa for endless microarray spotting and practical help in the lab. Everyone at the SNP genotyping and sequencing platform as well as Lillebil Andersson. For the help 62 and the positive attitude towards everything. This goes also for all people at the Department of Medical Sciences. You really brighten up the facilities. The current members in our research group. Anna Kiialainen, Olof Karlberg and Anders Lundmark for the collaboration on the publications. My fellow PhD students Per Lundmark, Chuan Wang and Jessica Nordlund as well as Sofia Nordman and Eva Berglund. You all contribute to the nice atmosphere in our group and to the great discussions and the good times we have had. Johanna Sandling for you help and for being a terrific colleague and friend throughout our joint PhD-student journey. Last but not least I want to thank all my dear friends and my dearest family. For everything. All you interest over the years and you’re never ending support, also during the excellent thesis retreat. 63 References Adams, M. D., S. E. Celniker, et al. The genome sequence of Drosophila melanogaster. Science 287(5461): 2185-2195 (2000). Ahmadian, A., B. Gharizadeh, et al. Genotyping by apyrase-mediated allele-specific extension. Nucleic Acids Res 29(24): E121 (2001). Aitichou, M., S. Saleh, et al. Dual-probe real-time PCR assay for detection of variola or other orthopoxviruses with dried reagents. J Virol Methods 153(2): 190-195 (2008). Albert, T. J., M. N. Molla, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 4(11): 903-905 (2007). Alves, A. M. and F. J. Carr. Dot blot detection of point mutations with adjacently hybridising synthetic oligonucleotide probes. Nucleic Acids Res 16(17): 8723 (1988). Ashburner, M. and C. M. Bergman. Drosophila melanogaster: a case study of a model genomic sequence and its consequences. Genome Res 15(12): 1661-1667 (2005). Astbury, W. Nucleic acid. Symposia of the Society for Experimental Biology 1(66) (1947). Astier, Y., O. Braha, et al. Toward single molecule DNA sequencing: direct identification of ribonucleoside and deoxyribonucleoside 5'-monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J Am Chem Soc 128(5): 1705-1710 (2006). Avery, O. T., C. M. Macleod, et al. Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med 79(2): 137-158 (1944). Aziah, I., M. Ravichandran, et al. Amplification of ST50 gene using dry-reagentbased polymerase chain reaction for the detection of Salmonella typhi. Diagn Microbiol Infect Dis 59(4): 373-377 (2007). Ball, M. P., J. B. Li, et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat Biotechnol 27(4): 361-368 (2009). Baner, J., A. Isaksson, et al. Parallel gene analysis with allele-specific padlock probes and tag microarrays. Nucleic Acids Res 31(17): e103 (2003). Barrett, J. C., S. Hansoul, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet 40(8): 955-962 (2008). Bell, P. A., S. Chaturvedi, et al. SNPstream UHT: ultra-high throughput SNP genotyping for pharmacogenomics and drug discovery. Biotechniques Suppl: 7072, 74, 76-77 (2002). Bennett, S. T. Solexa Ltd. Pharmacogenomics 5: 433-438 (2004). Bennett, S. T., C. Barnes, et al. Toward the 1,000 dollars human genome. Pharmacogenomics 6(4): 373-382 (2005). Bentley, D. R., S. Balasubramanian, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218): 53-59 (2008). 64 Berger, J., T. Suzuki, et al. Genetic mapping with SNP markers in Drosophila. Nat Genet 29(4): 475-481 (2001). Bodmer, W. and C. Bonilla. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40(6): 695-701 (2008). Braslavsky, I., B. Hebert, et al. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 100(7): 3960-3964 (2003). Buetow, K. H., M. Edmonson, et al. High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc Natl Acad Sci U S A 98(2): 581-584 (2001). Burdett, H. and M. van den Heuvel. Fruits and flies: a genomics perspective of an invertebrate model organism. Brief Funct Genomic Proteomic 3(3): 257-266 (2004). Celniker, S. E. and G. M. Rubin. The Drosophila melanogaster genome. Annu Rev Genomics Hum Genet 4: 89-117 (2003). Chee, M., R. Yang, et al. Accessing genetic information with high-density DNA arrays. Science 274(5287): 610-614 (1996). Chen, D., A. Ahlford, et al. High-resolution, high-throughput SNP mapping in Drosophila melanogaster. Nat Methods 5(4): 323-329 (2008). Chen, D., J. Berger, et al. FLYSNPdb: a high-density SNP database of Drosophila melanogaster. Nucleic Acids Res 37(Database issue): D567-570 (2009). Choi, M., U. I. Scholl, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 106(45): 19096-19101 (2009). Cirulli, E. T. and D. B. Goldstein. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 11(6): 415-425 (2010). Cohen, S. N., A. C. Chang, et al. Construction of biologically functional bacterial plasmids in vitro. Proc Natl Acad Sci U S A 70(11): 3240-3244 (1973). Conrad, D. F., D. Pinto, et al. Origins and functional impact of copy number variation in the human genome. Nature 464(7289): 704-712 (2010). Craddock, N., M. E. Hurles, et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464(7289): 713-720 (2010). Crick, F. Central dogma of molecular biology. Nature 227(5258): 561-563 (1970). Dahl, F., J. Baner, et al. Circle-to-circle amplification for precise and sensitive DNA analysis. Proc Natl Acad Sci U S A 101(13): 4548-4553 (2004). Dahl, F., M. Gullberg, et al. Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res 33(8): e71 (2005). Dahl, F., J. Stenberg, et al. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proc Natl Acad Sci U S A 104(22): 9387-9392 (2007). Dean, F. B., S. Hosono, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 99(8): 5261-5266 (2002). Dressman, D., H. Yan, et al. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci U S A 100(15): 8817-8822 (2003). Drmanac, R., A. B. Sparks, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327(5961): 78-81 (2010). 65 Drossman, H., J. A. Luckey, et al. High-speed separations of DNA sequencing reactions by capillary electrophoresis. Anal Chem 62(9): 900-903 (1990). Dufva, M., J. Petersen, et al. Detection of mutations using microarrays of poly(C)10poly(T)10 modified DNA probes immobilized on agarose films. Anal Biochem 352(2): 188-197 (2006). Fan, J. B., X. Chen, et al. Parallel genotyping of human SNPs using generic highdensity oligonucleotide tag arrays. Genome Res 10(6): 853-860 (2000). Fan, J. B., A. Oliphant, et al. Highly parallel SNP genotyping. Cold Spring Harb Symp Quant Biol 68: 69-78 (2003). Fire, A., S. Xu, et al. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391(6669): 806-811 (1998). Focke, M., F. Stumpf, et al. Microstructuring of polymer films for sensitive genotyping by real-time PCR on a centrifugal microfluidic platform. Lab Chip (2010). Fortini, M. E., M. P. Skupski, et al. A survey of human disease gene counterparts in the Drosophila genome. J Cell Biol 150(2): F23-30 (2000). Franklin, R. E. and R. G. Gosling. Evidence for 2-Chain Helix in Crystalline Structure of Sodium Deoxyribonucleate. Nature 172(4369): 156-157 (1953). Frazer, K. A., D. G. Ballinger, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449(7164): 851-861 (2007). Fredriksson, S., J. Baner, et al. Multiplex amplification of all coding sequences within 10 cancer genes by Gene-Collector. Nucleic Acids Res 35(7): e47 (2007). Fuller, C. W., L. R. Middendorf, et al. The challenges of sequencing by synthesis. Nat Biotechnol 27(11): 1013-1023 (2009). Ge, D., J. Fellay, et al. Genetic variation in IL28B predicts hepatitis C treatmentinduced viral clearance. Nature 461(7262): 399-401 (2009). Gerry, N. P., N. E. Witowski, et al. Universal DNA microarray method for multiplex detection of low abundance point mutations. J Mol Biol 292(2): 251-262 (1999). Gharizadeh, B., T. Nordstrom, et al. Long-read pyrosequencing using pure 2'deoxyadenosine-5'-O'-(1-thiotriphosphate) Sp-isomer. Anal Biochem 301(1): 8290 (2002). Gibbs, R. A., P. N. Nguyen, et al. Detection of single DNA base differences by competitive oligonucleotide priming. Nucleic Acids Res 17(7): 2437-2448 (1989). Gnirke, A., A. Melnikov, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27(2): 182-189 (2009). Gunderson, K. L., S. Kruglyak, et al. Decoding randomly ordered DNA arrays. Genome Res 14(5): 870-877 (2004). Hardenbol, P., J. Baner, et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 21(6): 673-678 (2003). Hardenbol, P., F. Yu, et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 15(2): 269-275 (2005). Hinds, D. A., L. L. Stuve, et al. Whole-genome patterns of common DNA variation in three human populations. Science 307(5712): 1072-1079 (2005). Hirschhorn, J. N. and M. J. Daly. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6(2): 95-108 (2005). Hirschhorn, J. N., P. Sklar, et al. SBE-TAGS: an array-based method for efficient single-nucleotide polymorphism genotyping. Proc Natl Acad Sci U S A 97(22): 12164-12169 (2000). Hodges, E., Z. Xuan, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 39(12): 1522-1527 (2007). 66 Holland, P. M., R. D. Abramson, et al. Detection of specific polymerase chain reaction product by utilizing the 5'----3' exonuclease activity of Thermus aquaticus DNA polymerase. Proc Natl Acad Sci U S A 88(16): 7276-7280 (1991). Horner, D. S., G. Pavesi, et al. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform 11(2): 181197 (2010). Hoskins, R. A., A. C. Phan, et al. Single nucleotide polymorphism markers for genetic mapping in Drosophila melanogaster. Genome Res 11(6): 1100-1113 (2001). Howell, W. M., M. Jobs, et al. Dynamic allele-specific hybridization. A new method for scoring single nucleotide polymorphisms. Nat Biotechnol 17(1): 87-88 (1999). Hultin, E., M. Kaller, et al. Competitive enzymatic reaction to control allele-specific extensions. Nucleic Acids Res 33(5): e48 (2005). Jobs, M., W. M. Howell, et al. DASH-2: flexible, low-cost, and high-throughput SNP genotyping by dynamic allele-specific hybridization on membrane arrays. Genome Res 13(5): 916-924 (2003). Kaderali, L., A. Deshpande, et al. Primer-design for multiplexed genotyping. Nucleic Acids Res 31(6): 1796-1802 (2003). Karas, M. and F. Hillenkamp. Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal Chem 60(20): 2299-2301 (1988). Kennedy, G. C., H. Matsuzaki, et al. Large-scale genotyping of complex DNA. Nat Biotechnol 21(10): 1233-1237 (2003). Kerem, B., J. M. Rommens, et al. Identification of the cystic fibrosis gene: genetic analysis. Science 245(4922): 1073-1080 (1989). Kim, S., M. E. Ulz, et al. Thirtyfold multiplex genotyping of the p53 gene using solid phase capturable dideoxynucleotides and mass spectrometry. Genomics 83(5): 924-931 (2004). Klatser, P. R., S. Kuijper, et al. Stabilized, freeze-dried PCR mix for detection of mycobacteria. J Clin Microbiol 36(6): 1798-1800 (1998). Korbel, J. O., A. E. Urban, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318(5849): 420-426 (2007). Kwok, P. Y. Methods for genotyping single nucleotide polymorphisms. Annu Rev Genomics Hum Genet 2: 235-258 (2001). Landegren, U., R. Kaiser, et al. A ligase-mediated gene detection technique. Science 241(4869): 1077-1080 (1988). Lander, E. S., L. M. Linton, et al. Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921 (2001). Larsson, C., I. Grundberg, et al. In situ detection and genotyping of individual mRNA molecules. Nat Methods 7(5): 395-397 (2010). Leamon, J. H., W. L. Lee, et al. A massively parallel PicoTiterPlate based platform for discrete picoliter-scale polymerase chain reactions. Electrophoresis 24(21): 37693777 (2003). Lee, L. G., C. R. Connell, et al. Allelic discrimination by nick-translation PCR with fluorogenic probes. Nucleic Acids Res 21(16): 3761-3766 (1993). Levene, M. J., J. Korlach, et al. Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299(5607): 682-686 (2003). Levene, P. The structure of yeast nucleic acid. The Journal of Biological Chemistry 40(2): 415-424 (1919). Levy, S., G. Sutton, et al. The diploid genome sequence of an individual human. PLoS Biol 5(10): e254 (2007). Li, J. B., Y. Gao, et al. Multiplex padlock targeted sequencing reveals human hypermutable CpG variations. Genome Res 19(9): 1606-1615 (2009). 67 Li, J. B., E. Y. Levanon, et al. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 324(5931): 1210-1213 (2009). Lindroos, K., S. Sigurdsson, et al. Multiplex SNP genotyping in pooled DNA samples by a four-colour microarray system. Nucleic Acids Res 30(14): e70 (2002). Liu, P. and R. A. Mathies. Integrated microfluidic systems for high-performance genetic analysis. Trends Biotechnol 27(10): 572-581 (2009). Liu, W. M., X. Di, et al. Algorithms for large-scale genotyping microarrays. Bioinformatics 19(18): 2397-2403 (2003). Livak, K. J., S. J. Flood, et al. Oligonucleotides with fluorescent dyes at opposite ends provide a quenched probe system useful for detecting PCR product and nucleic acid hybridization. PCR Methods Appl 4(6): 357-362 (1995). Lovmar, L., A. Ahlford, et al. Silhouette scores for assessment of SNP genotype clusters. BMC Genomics 6(1): 35 (2005). Lovmar, L., M. Fredriksson, et al. Quantitative evaluation by minisequencing and microarrays reveals accurate multiplexed SNP genotyping of whole genome amplified DNA. Nucleic Acids Res 31(21): e129 (2003). Lovmar, L. and A. C. Syvänen. Multiple displacement amplification to create a longlasting source of DNA for genetic studies. Hum Mutat 27(7): 603-614 (2006). Lyamichev, V., A. L. Mast, et al. Polymorphism identification and quantitative detection of genomic DNA by invasive cleavage of oligonucleotide probes. Nat Biotechnol 17(3): 292-296 (1999). Macdonald, S. J., T. Pastinen, et al. A low-cost open-source SNP genotyping platform for association mapping applications. Genome Biol 6(12): R105 (2005). Mamanova, L., A. J. Coffey, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 7(2): 111-118 (2010). Manolio, T. A., F. S. Collins, et al. Finding the missing heritability of complex diseases. Nature 461(7265): 747-753 (2009). Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet 24(3): 133-141 (2008). Mardis, E. R. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9: 387-402 (2008). Margulies, M., M. Egholm, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057): 376-380 (2005). Martin, S. G., K. C. Dobi, et al. A rapid method to map mutations in Drosophila. Genome Biol 2(9): RESEARCH0036 (2001). Matsuzaki, H., H. Loi, et al. Parallel genotyping of over 10,000 SNPs using a oneprimer assay on a high-density oligonucleotide array. Genome Res 14(3): 414-425 (2004). Mattick, J. S., R. J. Taft, et al. A global view of genomic information--moving beyond the gene and the master regulator. Trends Genet 26(1): 21-28 (2010). Maxam, A. M. and W. Gilbert. A new method for sequencing DNA. Proc Natl Acad Sci U S A 74(2): 560-564 (1977). Melamede, R. J. (1985). Automatable process for sequencing nucleotide US, New York Medical College (Valhalla, NY) Metzker, M. L. Sequencing technologies — the next generation. Nature Reviews Genetics 11(1): 31-46 (2010). Michael, K. L., L. C. Taylor, et al. Randomly ordered addressable high-density optical sensor arrays. Anal Chem 70(7): 1242-1248 (1998). Milani, L., M. Gupta, et al. Allelic imbalance in gene expression as a guide to cisacting regulatory single nucleotide polymorphisms in cancer cells 10.1093/nar/gkl1152. Nucl. Acids Res.: gkl1152 (2007). 68 Mitani, Y., A. Lezhava, et al. Rapid SNP diagnostics using asymmetric isothermal amplification and a new mismatch-suppression technology. Nat Methods 4(3): 257-262 (2007). Mitra, R. D. and G. M. Church. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res 27(24): e34 (1999). Moretti, T. R., A. L. Baumstark, et al. Validation of short tandem repeats (STRs) for forensic usage: performance testing of fluorescent multiplex STR systems and analysis of authentic and simulated forensic samples. J Forensic Sci 46(3): 647660 (2001). Moriyama, E. N. and J. R. Powell. Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 13(1): 261-277 (1996). Mullis, K., F. Faloona, et al. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harb Symp Quant Biol 51(Pt 1): 263-273. (1986). Need, A. C., D. Ge, et al. A genome-wide investigation of SNPs and CNVs in schizophrenia. PLoS Genet 5(2): e1000373 (2009). Neely, G. G., K. Kuba, et al. A global in vivo Drosophila RNAi screen identifies NOT3 as a conserved regulator of heart function. Cell 141(1): 142-153 (2010). Nejentsev, S., N. Walker, et al. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324(5925): 387-389 (2009). Nilsson, M., H. Malmgren, et al. Padlock probes: circularizing oligonucleotides for localized DNA detection. Science 265(5181): 2085-2088 (1994). Nyren, P., B. Pettersson, et al. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal Biochem 208(1): 171-175 (1993). Ohnishi, Y., T. Tanaka, et al. A high-throughput SNP typing system for genome-wide association studies. J Hum Genet 46(8): 471-477 (2001). Ohno, K., K. Tachikawa, et al. Microfluidics: applications for analytical purposes in chemistry and biochemistry. Electrophoresis 29(22): 4443-4453 (2008). Okou, D. T., K. M. Steinberg, et al. Microarray-based genomic selection for highthroughput resequencing. Nat Methods 4(11): 907-909 (2007). Oliphant, A., D. L. Barker, et al. BeadArray technology: enabling an accurate, costeffective approach to high-throughput genotyping. Biotechniques Suppl: 56-58, 60-51 (2002). Olivier, M. The Invader assay for SNP genotyping. Mutat Res 573(1-2): 103-110 (2005). Pal-Bhadra, M., U. Bhadra, et al. Cosuppression in Drosophila: gene silencing of Alcohol dehydrogenase by white-Adh transgenes is Polycomb dependent. Cell 90(3): 479-490 (1997). Pang, A. W., J. R. MacDonald, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol 11(5): R52 (2010). Parks, A. L., K. R. Cook, et al. Systematic generation of high-resolution deletion coverage of the Drosophila melanogaster genome. Nat Genet 36(3): 288-292 (2004). Pastinen, T., A. Kurg, et al. Minisequencing: a specific tool for DNA analysis and diagnostics on oligonucleotide arrays. Genome Res 7(6): 606-614. (1997). Pastinen, T., J. Partanen, et al. Multiplex, fluorescent, solid-phase minisequencing for efficient screening of DNA sequence variation. Clin Chem 42(9): 1391-1397. (1996). Pastinen, T., M. Raitio, et al. A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res 10(7): 1031-1042. (2000). 69 Pease, A. C., D. Solas, et al. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 91(11): 5022-5026 (1994). Petitjean, A., M. I. Achatz, et al. TP53 mutations in human cancers: functional selection and impact on cancer prognosis and outcomes. Oncogene 26(15): 21572165 (2007). Petitjean, A., E. Mathe, et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat 28(6): 622-629 (2007). Pop, M. and S. L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends Genet 24(3): 142-149 (2008). Porreca, G. J., K. Zhang, et al. Multiplex amplification of large sets of human exons. Nat Methods 4(11): 931-936 (2007). Prober, J. M., G. L. Trainor, et al. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238(4825): 336-341 (1987). R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. ISBN 3-900051-07-0 (2010). Redon, R., S. Ishikawa, et al. Global variation in copy number in the human genome. Nature 444(7118): 444-454 (2006). Reiter, L. T., L. Potocki, et al. A systematic analysis of human disease-associated gene sequences in Drosophila melanogaster. Genome Res 11(6): 1114-1125 (2001). Riordan, J. R., J. M. Rommens, et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245(4922): 1066-1073 (1989). Ronaghi, M., S. Karamohamed, et al. Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242(1): 84-89 (1996). Ronaghi, M., M. Uhlen, et al. A sequencing method based on real-time pyrophosphate. Science 281(5375): 363, 365 (1998). Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20: 53-65 (1987). Ryder, E., F. Blows, et al. The DrosDel collection: a set of P-element insertions for generating custom chromosomal aberrations in Drosophila melanogaster. Genetics 167(2): 797-813 (2004). Sachidanandam, R., D. Weissman, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409(6822): 928933 (2001). Sanger, F., S. Nicklen, et al. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12): 5463-5467 (1977). Schmidtke, J., B. Pape, et al. Restriction fragment length polymorphisms at the human parathyroid hormone gene locus. Hum Genet 67(4): 428-431 (1984). Schnorrer, F., I. Kalchhauser, et al. The transmembrane protein Kon-tiki couples to Dgrip to mediate myotube targeting in Drosophila. Dev Cell 12(5): 751-766 (2007). Sebat, J., B. Lakshmi, et al. Large-scale copy number polymorphism in the human genome. Science 305(5683): 525-528 (2004). Shendure, J. and H. Ji. Next-generation DNA sequencing. Nat Biotechnol 26(10): 1135-1145 (2008). Shendure, J., G. J. Porreca, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309(5741): 1728-1732 (2005). Sherry, S. T., M. H. Ward, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29(1): 308-311 (2001). 70 Shumaker, J. M., A. Metspalu, et al. Mutation detection by solid phase primer extension. Hum Mutat 7(4): 346-354 (1996). Smith, L. M., J. Z. Sanders, et al. Fluorescence detection in automated DNA sequence analysis. Nature 321(6071): 674-679 (1986). Southern, E. M., U. Maskos, et al. Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models. Genomics 13(4): 1008-1017 (1992). Spradling, A. C., D. Stern, et al. The Berkeley Drosophila Genome Project gene disruption project: Single P-element insertions mutating 25% of vital Drosophila genes. Genetics 153(1): 135-177 (1999). St Johnston, D. The art and design of genetic screens: Drosophila melanogaster. Nat Rev Genet 3(3): 176-188 (2002). Stankiewicz, P. and J. R. Lupski. Structural variation in the human genome and its role in disease. Annu Rev Med 61: 437-455 (2010). Steemers, F. J. and K. L. Gunderson. Whole genome genotyping technologies on the BeadArray platform. Biotechnol J 2(1): 41-49 (2007). Summerer, D. Enabling technologies of genomic-scale sequence enrichment for targeted high-throughput sequencing. Genomics 94(6): 363-368 (2009). Summerer, D., D. Hevroni, et al. A flexible and fully integrated system for amplification, detection and genotyping of genomic DNA targets based on microfluidic oligonucleotide arrays. N Biotechnol 27(2): 149-155 (2010). Summerer, D., N. Schracke, et al. Targeted high throughput sequencing of a cancerrelated exome subset by specific sequence capture with a fully automated microarray platform. Genomics 95(4): 241-246 (2010). Syvänen, A. C. Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet 2(12): 930-942 (2001). Syvänen, A. C. Toward genome-wide SNP genotyping. Nat Genet 37 Suppl: S5-10 (2005). Syvänen, A. C., K. Aalto-Setala, et al. A primer-guided nucleotide incorporation assay in the genotyping of apolipoprotein E. Genomics 8(4): 684-692 (1990). Syvänen, A. C., A. Sajantila, et al. Identification of individuals by analysis of biallelic DNA markers, using PCR and solid-phase minisequencing. Am J Hum Genet 52(1): 46-59. (1993). Tabor, S. and C. C. Richardson. A single residue in DNA polymerases of the Escherichia coli DNA polymerase I family is critical for distinguishing between deoxy- and dideoxyribonucleotides. Proc Natl Acad Sci U S A 92(14): 6339-6343 (1995). Tam, G. W., R. Redon, et al. The role of DNA copy number variation in schizophrenia. Biol Psychiatry 66(11): 1005-1012 (2009). Teague, B., M. S. Waterman, et al. High-resolution human genome structure by single-molecule analysis. Proc Natl Acad Sci U S A 107(24): 10848-10853 (2010). Tewhey, R., J. B. Warner, et al. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat Biotechnol 27(11): 1025-1031 (2009). The International HapMap Consortium, P. A haplotype map of the human genome. Nature 437(7063): 1299-1320 (2005). Tjio, J. and A. Levan. The chromosome number of man. Hereditas 42: 1-6 (1956). Toth, G., Z. Gaspari, et al. Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res 10(7): 967-981 (2000). Turner, E. H., S. B. Ng, et al. Methods for genomic partitioning. Annu Rev Genomics Hum Genet 10: 263-284 (2009). 71 Tweedie, S., M. Ashburner, et al. FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res 37(Database issue): D555-559 (2009). Tyagi, S., D. P. Bratu, et al. Multicolor molecular beacons for allele discrimination. Nat Biotechnol 16(1): 49-53. (1998). Tyagi, S. and F. R. Kramer. Molecular beacons: probes that fluoresce upon hybridization. Nat Biotechnol 14(3): 303-308 (1996). Vallone, P. M. and J. M. Butler. AutoDimer: a screening tool for primer-dimer and hairpin structures. Biotechniques 37(2): 226-231 (2004). Venken, K. J. and H. J. Bellen. Emerging technologies for gene manipulation in Drosophila melanogaster. Nat Rev Genet 6(3): 167-178 (2005). Venter, J. C., M. D. Adams, et al. The sequence of the human genome. Science 291(5507): 1304-1351. (2001). Vos, P., R. Hogers, et al. AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res 23(21): 4407-4414 (1995). Wallace, R. B., J. Shaffer, et al. Hybridization of synthetic oligodeoxyribonucleotides to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Res 6(11): 3543-3557 (1979). Wang, J., W. Wang, et al. The diploid genome sequence of an Asian individual. Nature 456(7218): 60-65 (2008). Watson, J. D. and F. H. Crick. A structure for deoxyribose nucleic acid. Nature 171(4361): 964-967 (1953). Weigl, B., G. Domingo, et al. Towards non- and minimally instrumented, microfluidics-based diagnostic devices. Lab Chip 8(12): 1999-2014 (2008). Werner, J. H., H. Cai, et al. Progress towards single-molecule DNA sequencing: a one color demonstration. J Biotechnol 102(1): 1-14 (2003). Wheeler, D. A., M. Srinivasan, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189): 872-876 (2008). White, R. A., 3rd, P. C. Blainey, et al. Digital PCR provides sensitive and absolute calibration for high throughput sequencing. BMC Genomics 10: 116 (2009). Whitesides, G. M. The origins and the future of microfluidics. Nature 442(7101): 368373 (2006). Wu, D. Y., L. Ugozzoli, et al. Allele-specific enzymatic amplification of beta-globin genomic DNA for diagnosis of sickle cell anemia. Proc Natl Acad Sci U S A 86(8): 2757-2760 (1989). 72