Applications of Four-Colour Fluorescent Primer Extension

To my family
List of Publications
This thesis is based on the following publications, which are referred to in
the text by their Roman numerals.
I
Lovmar L, Ahlford A, Jonsson M, Syvänen A-C. “Silhouette”
scores for assessment of SNP genotype clusters. BMC Genomics. (2005) 6:35.
II (a) Chen D*, Ahlford A*, Schnorrer F*, Kalchhauser I, Fellner
M, Viràgh E, Kiss I, Syvänen A-C, Dickson BJ. Highresolution, high-throughput SNP mapping in Drosophila
melanogaster. Nature Methods. (2008) 5(4),323-9.
(b) Schnorrer F*, Ahlford A*, Chen D*, Milani L, Syvänen AC. Positional cloning by fast-track SNP-mapping in Drosophila melanogaster. Nature Protocols. (2008) 3(11),1751-65.
III Ahlford A, Kjeldsen B, Reimers J, Lundmark A, Wolff A, Romani M, Syvänen A-C, Brivio M. Dried reagents for multiplex
genotyping by tag-array minisequencing to be used in microfluidic devices. Analyst. (2010) Jul 29. (Published online ahead
of print)
IV Kiialainen A, Karlberg O, Ahlford A, Sigurdsson S, LindbladToh K, Syvänen A-C. Performance of microarray and liquid
based capture methods for target enrichment for massively
parallel sequencing and SNP discovery. Submitted manuscript.
* These authors contributed equally to this work
Reprints were made with permission from the respective publishers.
Additional publication by the author
Li J.B, Gao Y, Aach J, Zhang K, Kryukov GV, Xie B, Ahlford
A, Yoon JK, Rosenbaum AM, Zaranek AW, LeProust E, Sunyaev SR, Church GM. Multiplex padlock targeted sequencing
reveals human hypermutable CpG variations. Genome Research. (2009) 19(9),1606-15.
Supervisors
Prof. Ann-Christine Syvänen, Department of Medical Sciences, Uppsala
University, Uppsala, Sweden
Associate Prof. Mats Nilsson, Department of Genetics and Pathology,
Uppsala University, Uppsala, Sweden
Chair
Prof. Eva Tiensuu Janson, Department of Medical Sciences, Uppsala
University, Uppsala, Sweden
Faculty opponent
Prof. David Schwartz, Department of Chemistry, Laboratory of Genetics,
University of Wisconsin-Madison, WI, USA
Review board
Prof. Pål Nyrén, School of Biotechnology, Albanova, KTH (Royal Institute of Technology), Stockholm, Sweden
Associate Prof. Mattias Mannervik, Wenner-Gren Institute, Developmental Biology, Stockholm University, Stockholm, Sweden
Prof. Lars Engstrand, Swedish Institute for Infectious Disease Control and
Department of Microbiology, Cell and Tumor Biology, Karolinska Institutet, Stockholm, Sweden
Contents
Introduction ................................................................................................... 13
The chemical structure of DNA ............................................................... 13
Sequence variation ................................................................................... 14
Analysis of sequence variation ................................................................. 15
Polymorphisms as genetic markers ..................................................... 15
Gene mapping in model organisms ..................................................... 17
Resources for sequence analysis .............................................................. 19
Methods for variant detection................................................................... 20
Complexity reduction .......................................................................... 21
Reaction principles .............................................................................. 25
Assay formats ...................................................................................... 35
Coding and decoding ........................................................................... 37
Data analysis and quality control ......................................................... 38
Aim of this thesis .......................................................................................... 39
The present study .......................................................................................... 40
Tag-array minisequencing ........................................................................ 40
Reaction steps and assay set-up ........................................................... 40
Silhouette scores for quality assessment of genotype data ....................... 43
Silhouette scores .................................................................................. 44
Enzyme performance ........................................................................... 45
Tools for gene mapping in Drosophila melanogaster.............................. 47
SNP map and FLYSNPdb ................................................................... 48
Genome-wide genotyping assays for gene mapping ........................... 48
SNPmapper for simplified data analysis.............................................. 49
Resources for fine-mapping ................................................................. 50
Mapping of genes underlying muscle development ............................ 50
Tag-array minisequencing in a lab-on-a-chip........................................... 51
Freeze-drying of reagents .................................................................... 51
Systematic genotyping comparison ..................................................... 52
Genotyping in micro-fabricated structures .......................................... 54
Sample complexity reduction for re-sequencing candidate genes............ 56
Study design ........................................................................................ 56
Capture quality .................................................................................... 57
Variant calling ..................................................................................... 57
Concluding remarks ...................................................................................... 59
Acknowledgements ....................................................................................... 62
References ..................................................................................................... 64
Abbreviations
bp
CNV
dNTP
ddNTP
ExoI
GA
InDel
kb
Mb
MIP
MPS
nt
PCR
RFLP
SAP
SBE
SBS
SNP
µ-TAS
Base pair
Copy number variant
Deoxynucleotide triphosphates
Dideoxynucleotide triphosphates
Exonuclease I
Genome Analyzer
Insertion-deletion polymorphism
Kilobase pair
Megabase pair
Molecular inversion probe
Massively parallel sequencing
Nucleotide
Polymerase chain reaction
Restriction fragment length polymorphisms
Shrimp alkaline phosphatase
Single base primer extension
Sequencing by synthesis
Single nucleotide polymorphism
Micro-total analysis systems
Introduction
The advances in the field of biomedical research have the last decades been
driven by the rapid development of technology. Dideoxy “Sanger” sequencing
(Sanger et al. 1977) for the determination of the base order of DNA molecules, together with polymerase chain reaction (PCR) (Mullis et al. 1986) for
targeted amplification of DNA, are the two most important examples. These
tools have revolutionized molecular biology and made it possible to determine
the complete sequence of the human genome, of which the first draft celebrates its 10 year anniversary this year (Lander et al. 2001; Venter et al. 2001).
The trend has continued through the “post-genomic” era with the development
of large-scale genotyping techniques for the study of genetic sequence variation (Syvänen 2005). More recently, principles for massively parallel sequencing (MPS) have followed (Mardis 2008; Shendure and Ji 2008; Metzker
2010). These tools enable extensive genetic research. In addition, hopes are
high that the technology may have an impact on clinical genetics and its transition into healthcare. This thesis addresses some of the features related to the
analysis of sequence variation in the genetic model organism Drosophila
melanogaster and in humans.
The chemical structure of DNA
The deoxyribonucleic acid (DNA) molecule contains the code for the inherited
information of all living organisms (Avery et al. 1944). Commonly, DNA
consists of two long polymers made up of four different types of units called
nucleotides. Each nucleotide is composed of the sugar molecule deoxyribose,
a phosphate group and one of the four nitrogen bases, adenine, thymine, cytosine and guanine (Levene 1919; Astbury 1947). The nucleotides are ordered in
a linear sequence in a predetermined way. The two molecules are complementary to each other and twined together to form a double helix where the bases
pair in a specific manner (Franklin and Gosling 1953; Watson and Crick 1953)
. In each cell, these large molecules are organized into structures called chromosomes. In the fruit-fly Drosophila melanogaster the genetic information is
encoded in ~180 million base pairs (bp) (Celniker and Rubin 2003) packed
into four pairs of chromosomes: the X/Y sex chromosomes and the autosomes
2, 3 and 4. In humans the complete genetic code, the genome, is about 20
times bigger (Lander et al. 2001) and organized in 23 pairs of chromosomes
13
(Tjio and Levan 1956). The portions of DNA that contains the genetic information are called genes. These are read and transcribed from DNA to RNA,
which in turn can be translated into amino acids to form proteins with different functions, according to the central dogma of molecular biology stated by
Francis Crick (Crick 1970). Surprisingly, the number of protein coding genes
in more complex organisms, like humans, is only slightly higher than in the
fruit fly. It is therefore suggested that the complexity lies in the plasticity and
regulation of expression of the human genome. Many studies have shown that
the large fraction of non-protein coding DNA sequences have important biological functions (Mattick et al. 2010).
The comparisons of DNA sequences from a great number of different organisms have revealed the common features of life. Fundamental genetic principles and functions are amazingly conserved throughout evolution considering
the wide spectrum of organisms on earth. This commonality makes it possible
to use model organisms for understanding principles that can be applied to
more complex systems. Drosophila has been one of the most studied organisms in biological research already for almost a century, particularly in genetics and developmental biology (Burdett and van den Heuvel 2004). The fruit
fly serves as an important research model and has in many aspects contributed
to knowledge important for human health such as the understanding of human
development and basic functions. About half of the protein coding genes in
Drosophila have human homologues and 75% of known human disease genes
have a homolog in the genetic code of the fruit fly (Fortini et al. 2000; Reiter
et al. 2001).
Sequence variation
The increasing amount of genetic sequence information from multiple individuals has facilitated the study of sequence variation both between and within
species. It is estimated that the haploid genome of two human beings on average differ by 0.5-1% (Levy et al. 2007; Pang et al. 2010). Together with environmental factors, these inherited signatures contribute to the phenotype of
each individual and are part of what makes each of us unique.
All variation in the genome has occurred due to mutation events in the DNA
sequence, either caused by copying errors during cell division or by the exposure to environmental factors such as toxic compounds, radiation or viruses.
Variations that arise in the germline are passed on to the offspring while somatic mutations only affect the individual organisms. Genetic sequence variations range from large chromosomal alterations to substitutions of individual
nucleotides. The most abundant form is single nucleotide polymorphisms
(SNPs), which are germline mutations that are fixed in the population. They
14
occur approximately every 1,200 bp in the human genome (Sachidanandam et
al. 2001) and every 200 bp in the Drosophila genome (Moriyama and Powell
1996). The complete sequenced individual diploid genomes show that each
human carries around 3 million SNPs compared to the human genome reference sequence, which constitute 0.1% of the total sequence (Levy et al. 2007;
Bentley et al. 2008; Wang et al. 2008; Wheeler et al. 2008). In addition to the
single nucleotide variations, approximately half of the human genome consists
of repetitive nucleotide sequences (Lander et al. 2001), (Sachidanandam et al.
2001) (Korbel et al. 2007). Examples of structural alterations are large copy
number variants (CNVs), deletions or insertions, and small (1-5 bp) insertiondeletion polymorphisms (InDels). Structural variation in total constitutes a
bigger fraction of nucleotides than SNPs (Conrad et al. 2010; Pang et al. 2010)
and recent studies suggest that non-SNP variation, and in particular small
CNVs also arise more frequently than previously expected (Stankiewicz and
Lupski 2010). Short tandem repeats (STRs) or microsatellites consist of repeated units, 2-6 nucleotides in length (Toth et al. 2000). They are often
highly polymorphic but less abundant in the genome. SNPs that create or disrupt a restriction enzyme cleavage site are called restriction fragment length
polymorphisms (RFLPs) (Schmidtke et al. 1984). Alternatively, a repeated
sequence or insertions, deletions, translocations and inversions, can also lead
to RFLPs. In the studies included in this thesis we have focused on the detection and analysis of small sequence variants including SNPs and InDels.
Analysis of sequence variation
The impact of sequence variation is dependent on its genomic location. Variants occurring in the coding sequence may alter amino acids of proteins and
changes in intragenic regions and in introns can affect the stability of transcripts and the regulation of gene expression. The variants may be neutral,
lead to an altered phenotype, or give rise to disease.
Polymorphisms as genetic markers
Sequence variants serve as markers in genetic studies. Variant analysis for
indentifying disease genes commonly relies on one of two different approaches; linkage studies in family pedigrees, and candidate gene or genomewide associations studies (GWAS) on sets of affected and unaffected control
individuals. In linkage studies the co-segregation of a characteristic with a
marker is traced based on their physical proximity on a chromosome with
limited recombination between them. Chromosomal segments that are shared
between affected individuals in a family are identified. The altered cleavage
pattern caused by RFLPs were the first type of markers to be used in linkage
analysis and enabled the positional cloning of the cystic fibrosis gene in 1989
15
(Kerem et al. 1989; Riordan et al. 1989). The disease is an effect of a 3-bp
deletion (delF508) in the CFTR gene, resulting in a non-functional protein
product. RFLPs were later replaced by microsatellites which are multi-allelic
and more informative as genetic markers. Both RFLPs and microsatellites are
however difficult to analyze in a high-throughput way and recently SNPs have
gained importance in linkage studies. The bi-allelic nature of SNPs makes
them less informative but technically easy to analyze at a large scale in an
automated way. SNPs also provide high resolution due to their high density in
the genome. Linkage studies have been very successful in identifying highly
penetrant rare alleles with monogenic patterns. Most diseases that are inherited
in a Mendelian way have now been identified and many are analyzed in patients in routine diagnostic laboratories today
Many types of genetic variants are predisposing factors in common multifactorial disorders. Recently SNPs have been widely applied as markers in association studies which rely on linkage disequilibrium (LD) or the non-random
association between the loci influencing the disease and the tested marker.
GWAS were initiated to identify the high frequency genetic variants underlying common disease in an unbiased way. These studies have resulted in the
detection of genes for many complex disorders such as Crohn’s disease, where
more than 30 loci have been identified (Barrett et al. 2008). In addition, these
studies have identified novel cellular pathways important for the pathogenesis
of different diseases which can pinpoint the disease mechanisms and suggest
new drug targets. Despite some success, common diseases have turned out to
be complicated to map and the functional variants difficult to identify. The
reason for this is that they are caused by multiple genetic factors with low
effect sizes in combination with environmental features. The causal alleles
may also be rare and differ between individuals or among populations. Different approaches are currently discussed and tested in medical genetics to further elucidate the genetic component of complex disease (Bodmer and Bonilla
2008; Manolio et al. 2009; Cirulli and Goldstein 2010). Re-sequencing of
interesting regions or whole genomes, of patients and individuals from the
general population using MPS technology, is one promising means. The new
sequencing technologies and “paired-end” analysis will also play an important
role for precisely mapping and exploring more complex sequence variants in
relation to disease and other phenotypes (Stankiewicz and Lupski 2010).
Complex variants often lead to severe pathogenic phenotypes, and large
chromosomal aberrations are for example commonly found in tumour cells.
Many studies have shown that CNVs give rise to different mental disorders
like schizophrenia (Need et al. 2009; Tam et al. 2009), but that they are frequent also in healthy individuals (Sebat et al. 2004; Korbel et al. 2007).
Additionally SNPs are important as markers in population genetics, evolutionary studies, and in pharmacogenetics (Ge et al. 2009). RFLPs and microsatel16
lites have also been frequently used as markers for gene mapping and in forensic investigations (Moretti et al. 2001).
Gene mapping in model organisms
Forward genetic screens are performed by selecting individuals who possess a
phenotype of interest and subsequently identifying the unknown underlying
gene (Figure 1) (St Johnston 2002). In model organisms such as Drosophila
melanogaster, mutations can be induced in individuals by exposure to a
mutagen, such as a chemical or radiation. Mutants that harbour a desired phenotype are then isolated. Other mutagens, such as random DNA insertions by
transformation or active transposons, can also be used to generate new mutants. An advantage with these techniques is that the new alleles are tagged by
a known molecular marker that can facilitate the identification of the gene.
The mutated gene is next located on its chromosome by first crossbreeding the
mutants with individuals carrying other traits with known locations, and finally by estimating how frequently the traits and the mutant phenotype are
inherited together in recombinant individuals. Large-scale genetic screens in
the multi-cellular model organism Drosophila are powerful for gaining
knowledge about cellular and developmental processes. The possibility of
performing clonal screens enables the discovery of mutations underlying the
affect of a given process, and the identification of phenotypes in practically
any cell, or stage of development.
Figure 1. In gene mapping the line carrying an induced mutation (indicated by a star)
is crossed to a divergent mapping line to generate recombinants. The phenotype of the
recombinants are scored as mutant of wildtype, and by combining this information
with the SNP genotype the recombination breakpoints are defined and the location of
the mutation is identified.
17
The identification of the mutated gene is usually the rate limiting step in genetic screens. In classical screens phenotypic traits are used to map the new
mutant alleles. Traditional mapping strategies in Drosophila are recombination mapping, deficiency mapping (Parks et al. 2004) or mapping with the
help of P-elements (Spradling et al. 1999; Ryder et al. 2004), which mainly
rely on genetic and cytological markers that are not easily linked to the molecular map. Moreover, phenotypic traits are not very frequent, or evenly
spread throughout the chromosomes. Large gene regions are “cold-spots” for
P-element insertions which result in a bias for mutagenesis of certain loci.
These mapping traits also suffer from variability in the genetic background.
With an increasing amount of genomic sequence information available, mutations in model organisms are frequently mapped by using small molecular
polymorphisms as mapping traits. Examples of such are microsatellites, InDels and most commonly SNPs. In contrast to the classical mapping techniques mentioned above, SNP mapping offers many advantages including
phenotypic neutrality, defined genetic background, high resolution and even
distribution in the DNA sequence between common laboratory strains, and the
feasibility to analyze all mutational classes (Martin et al. 2001). In several
model organisms, mapping of genes with the help of polymorphisms has thus
proven to be the preferred method. SNP mapping in Drosophila has relied on
traditional genotyping methods like analysis of RFLPs, PCR product-length
polymorphism (PLP) assays, and denaturing high performance liquid chromatography (DHPLC) (Hoskins et al. 2001) (Berger et al. 2001). These methods
are well established and easy to perform with standard equipment. They are
however hampered by the low throughput, and the genotyping procedure has
therefore been a bottle-neck in mapping studies. Recently there has been a
enormous technology development in the genotyping field, and large-scale
systems are available for parallel genotyping of large SNP sets at a low cost
per genotype (Syvänen 2005), described later on in this thesis. The current
systems are however available mainly for fixed human applications and often
involve huge investments of specialized instrumentation. As stated by Macdonald S. et al (Macdonald et al. 2005) it is difficult to identify an available
system for efficient, cost-effective genotyping at an intermediate scale outside
of human genetics, and there is a need for open-source and off-the-shelf solutions to fill this gap. In Study II we developed a cheap microarray-based genotyping system to facilitate SNP mapping in Drosophila.
In reverse genetic screens, mutations are instead generated in known genes or
at known genome positions, and the phenotypic consequence is subsequently
studied. With the genome sequence available it has become possible to silence
genes using RNA interference (RNAi) in model organisms such as Caenorhabditis elegans (Fire et al. 1998) and Drosophila melanogaster (Pal-Bhadra
et al. 1997; Neely et al. 2010).
18
Resources for sequence analysis
The advances in molecular biology and the large genome projects have resulted in publicly available resources, including sequence databases and
analysis tools, for a great number of organisms. Human and Drosophila are
among the organism with the most comprehensive tools available. The National Center for Biotechnology Information (NCBI) was initiated by the National Institutes of Health (NIH) already in 1988, and is today the largest resource for biomedical and genomic information (http://www.ncbi.nlm.
nih.gov/). The Ensembl project is a joint effort between EMBL-EBI and the
Wellcome Trust Sanger Institute since 2000 to establish genome databases for
vertebrates and other eukaryotic species including automatic sequence annotations (http://www.ensembl.org). A similar resource is provided by the University of California Santa Cruz (UCSC) (http://genome.ucsc.edu).
One of the databases associated with the NCBI is the Single Nucleotide Polymorphism database, dbSNP (http://www. ncbi.nlm.nih.gov/SNP/) (Sherry et
al. 2001). It serves as a central repository of SNPs and short InDels, currently
for 58 different organisms. The majority of the polymorphisms are human and
build 131 (April 2010) comprised 23.7 million variants of which 14.7 million
have been validated. There has been a tremendous increase in the number of
variants reported in the last years, primarily due to the increased amount of
human sequence data generated by MPS projects. By the effort of the international Haplotype mapping (HapMap) project, a catalogue of common human
variation (>1%) in individuals from different populations has been established
(The International HapMap Consortium 2005; Frazer et al. 2007). The HapMap phase III data was released at the end of May 2010 and in total the resource today comprises data for 1301 individuals from 11 populations
(http://hapmap.ncbi.nlm.nih.gov). The goal of the HapMap project has been to
identify the chromosomal regions where genetic variants are shared in the
different populations, the common haplotype blocks, and to identify so called
"tag" SNPs that uniquely identify each of these blocks. The 1000 genomes
project was initiated in 2008 with the aim of sequencing the complete genomes and gene regions of a large number of individuals using the new MPS
technology (http://www.1000genomes.org). The goal is to provide a comprehensive resource on human genetic variation, and to discover and characterize
rare single nucleotide alleles that are present at minor allele frequencies
around 1% or less.
FlyBase is an extensive bioinformatics database for the genome of the model
organism Drosophila melanogaster and related species (http://flybase.org). It
includes the highest-quality genome sequences and annotations available for
any organism (Ashburner and Bergman 2005; Tweedie et al. 2009). Despite
the big toolbox available for researchers in the Drosophila community
19
(Venken and Bellen 2005) it has until recently lacked a large-scale catalogue
of small molecular polymorphisms, similar to the human dbSNP. Study IIa of
this thesis includes an effort to indentify sequence variants in common Drosophila strains and to establish a high resolution variant database (Chen et al.
2008; Chen et al. 2009).
Methods for variant detection
The large-scale genomic efforts mentioned above have provoked a rapid development in technology for the analysis of sequence variation, primarily
SNPs. Today there are many genotyping systems available, both commercial
and open source, for scoring known SNPs in the whole complexity range from
single markers to genome wide analysis. Figure 2 compares different genotyping methods in terms of sample and marker throughput. Recently there have
been significant efforts to establish large-scale genotyping systems where key
aspects have been cost efficiency per genotype, automation, and parallelisation
of the biochemical reactions and the assay format. As a result, commercially
systems for genetic analysis on a very large scale, and based on microarray
technology, are now available in human genetics (Syvänen 2005). The largest
assay for highly multiplexed SNP genotyping, provided by Illumina (Steemers
and Gunderson 2007), allows parallel scoring of 2.5 million SNPs, to be increased to 5 million before long (www.illumina.com).
Nevertheless, the genotyping systems available today still suffer some limitations. With the growing numbers of SNPs identified it is desirable to further
lower the cost per genotype to facilitate wide-spread large-scale genotyping.
Additionally, the chips should be extended to comprise not only common
SNPs but also rare variants. The largest, genome-wide genotyping approaches
are primarily developed for fixed panels of markers to identify disease predisposing SNPs that are associated to complex diseases in humans (Hirschhorn
and Daly 2005) and require significant investments in equipment with high
reagent costs. Moreover, the systems are inflexible for adaptation to specialized and novel applications. Therefore, cost-efficient genotyping systems on
an intermediate scale are needed. These will be required for replication of
GWAS findings, for follow-ups of sequencing studies, as well as for future
genetic analysis and diagnostics of previously identified risk variants. Moreover, genotyping in non human systems has a need for good solutions and
better intermediate scale systems in an open-source manner. Although microarray-based methods have been successfully adapted for the analysis of
CNVs (Redon et al. 2006; Craddock et al. 2010), large-scale analysis of more
complex structural variants needs to be addressed further by new approaches
(Teague et al. 2010).
20
The existing genotyping platforms all combine different reaction principles for
allele discrimination with a number of assay formats and strategies for labelling and for read-out of individual signals, in a modular fashion (Kwok 2001;
Syvänen 2001). Below follows a description of these modules and examples
of technology.
Figure 2. Comparison of SNP multiplexing levels and sample throughput in common
SNP genotyping systems. In the Axiome system Affymetrix has increased the sample
throughput of their hybridization-based method.
Complexity reduction
Direct analysis of sequence variants in the complexity of the genome of higher
organisms is technically challenging. To achieve a sufficient sensitivity and
specificity, most techniques for genotyping or sequencing require either amplification of the starting material for further analysis, and/or amplification of
the read-out signal after processing. Despite the recent improvements in MPS
technology, it is costly and time consuming to sequence the whole genome of
complex organisms in a large number of individuals (Summerer 2009; Turner
et al. 2009; Mamanova et al. 2010). Additionally, the amount of data generated and the data handling is overwhelming for most laboratories today.
Hence, in many MPS studies it is desirable to reduce the complexity of the
21
template to perform targeted analysis on portions of the genome in multiple
individuals, and utilize the sequencing capacity of MPS machines in a more
optimal way.
Polymerase chain reaction
Two commonly used approaches for amplification of specific genomic fragments are cloning (Cohen et al. 1973) or amplification by polymerase chain
reaction (PCR) (Mullis et al. 1986). PCR is powerful for increasing sensitivity
and specificity for downstream assays. With PCR a locus-specific region of
the genome is amplified, which increases the number of template copies and
reduces the complexity of the DNA to be analysed. The principle requires very
low amounts of input material. For over a decade it has been a golden standard
for copying of DNA. Recently, long range PCR has been adapted for targeting
smaller regions for MPS (1-100 kb) (Mamanova et al. 2010), and applied to
sequence candidate genes in pools of a larger number of individuals
(Nejentsev et al. 2009). The increasing amounts of genes, SNPs, and sequences to be analyzed, require procedures for parallel amplification. However, two problems with multiplex PCR are uneven amplification efficiency
due to sequence-dependent differences of the fragments, and an exponentially
increase of amplification artefacts as the number of added primer-pairs grow.
Presently, multiplex PCR reactions are a bottleneck in highly multiplex genotyping methods, since careful design and optimization of PCR assays at multiplexing levels exceeding 10-20 amplicons is difficult and time consuming,
even with better design software. One way to circumvent the difficulties with
multiplex PCR is to physically separate individual DNA molecules before
clonally amplifying them. PCR colonies, or “polonies” (Mitra and Church
1999) enable parallel amplification of DNA by separating PCR reactions in a
thin polyacrylamide film, and in emulsion PCR (Dressman et al. 2003) clonal
amplification of fragments from single DNA molecules is performed in a water-in-oil emulsion. Recently, micro-droplet based PCR was applied for the
parallel amplification of 457 fragments (172 kb total genomic size) (Tewhey
et al. 2009) for subsequent MPS. It is currently limited to ~4000 primer pairs
and a throughput of 8 samples per day. Additionally it requires dedicated instrumentation and as much as 7.5µg of input material.
One example of an alternative PCR-strategy is rolling circle amplification
(RCA) and circle-to-circle amplification, presented by Dahl et al (Dahl et al.
2004), where a circularized target molecule is repeatedly amplified to generate
a long linear molecule containing tandem repeats of the target sequence. Reduction of genome complexity has also proven successful by the amplified
fragment length polymorphism (AFLP) approach. In AFLP, genomic DNA is
cleaved by restriction digestion followed by PCR-amplification with universal
linker-primers and fragment size selection. This approach has been used for
22
SNP genotyping (Vos et al. 1995) and the same principle is employed in the
Affymetrix GeneChip system (Kennedy et al. 2003; Matsuzaki et al. 2004).
A reversed approach is taken by methods which are based on initial allele
detection directly in genomic DNA, followed by PCR with universal primer
pairs for signal amplification (Oliphant et al. 2002; Baner et al. 2003). Illumina applies this solution in their GoldenGate assay for multiplexed genotyping in large-scale projects (Fan et al. 2003).
For undirected DNA amplification, a number of whole genome amplification
(WGA) strategies in different forms are available (Lovmar and Syvänen
2006). These are powerful for increasing the total amount of DNA and to secure the source material. Multiple displacement amplification (MDA) is the
most commonly used approach, which is based on an isothermal and branchlike amplification using the DNA polymerase of the bacteriophage phi29 and
random primers (Dean et al. 2002). Illumina has in their Infinium II assay
implemented an initial step of MDA followed by allele discrimination and
finally signal magnification with an antibody sandwich strategy (Steemers and
Gunderson 2007).
Capture by circularization
Enzymatic reactions and oligonucleotides are employed for target capture by
circularization. Padlock probes/molecular inversion probes (MIPs) (Nilsson et
al. 1994) are linear single stranded oligonucleotides consisting of two target
specific sequences flanking a universal linker for amplification. They require a
highly specific dual recognition event when hybridized to their target molecule, which enables a filling of the “gap” and a ligation to generate a circular
library molecule. MIPs have been applied for large-scale target capture at Mb
scale and MPS for; analysis of 10,000 human exons (Porreca et al. 2007),
identification of genetic variation in hypermutable CpG regions (Li et al.
2009) methylation profiling (Ball et al. 2009) or to screen predicted RNA editing sites (Li et al. 2009), using array-released oligonucleotide libraries. The
principle eliminates the need for additional shot gun library preparation. Since
it is solution-based it is easily scaled up and automated using standard molecular laboratory equipment. The main limitations are the capture uniformity and
the upfront cost of probes for large target regions and the difficulty to obtain
large amounts of each probe species. Double stranded Collector probes
(Fredriksson et al. 2007) and Selector probes (Dahl et al. 2005; Dahl et al.
2007) follow a similar reaction principle for target capture. These reactions
include an initial multiplex PCR or restriction digestion after which the probes
are used to guide the circularization of the target molecules. They have been
applied for sequencing the coding sequence of 10 human cancer genes (Dahl
et al. 2007; Fredriksson et al. 2007) but currently only enables intermediate
scale enrichment.
23
Capture by hybridization
Targeted pull-down reactions based on hybridization of target molecules to
oligonucleotide microarrays have been performed for enriching ~5Mb exonic
regions per reaction for MPS (Albert et al. 2007; Hodges et al. 2007; Okou et
al. 2007). This approach, provided by Roche/Nimblegen (http://www.nim
blegen.com), requires an initial library preparation step which is followed by
hybridization to target specific oligonucleotide arrays containing 385,000
probes (Figure 3A) and recovery and amplification of the captured targets.
Recently the principle has been further scaled up to facilitate the capture of up
to 34Mb on a single array and applied for targeting the whole human exome
for genetic diagnosis of Bartter syndrome (Choi et al. 2009). It is available for
exon capture, and can be designed for custom and continuous target regions.
The ability to capture large regions is a strength while the array-based format
requires expensive hardware and gets laborious for larger number of samples.
Additionally, it requires rather high amount of starting material, regardless of
the total size of the target region. A similar approach in a fully automated microfluidic format has also been used to capture a 480 kb exome subset of 115
cancer-related genes (Summerer et al. 2010). This format is still only adaptable for targets up to 1Mb in size but has the advantage of the integrated microfluidic format, promising for scaling up and for the adaption to clinical
applications.
Gnirke et al. presented a hybridization-based capture method in solution for
MPS using long (170-mer) RNA probes to capture >15,000 coding exons and
four continuous regions, 4.2 Mb in total size (Gnirke et al. 2009) (Figure 3B).
Today Agilent Technologies (http://www.home.agilent.com) are providing the
SureSelect target capture method in solution for MPS both for exon sequencing and for customized capture. The method involves the same reaction steps
as the array capture but overcomes many of the obstacles related to the array
format. The reactions are easily parallelized and automated for increased sample throughput and are performed using standard molecular biology instrumentation. Compared to the array format the hybridization is driven further to
completion by an access of probe molecules per target molecule and the reaction time is decreased. The probe cost decreases with the number of samples
analysed, making the system suitable for large-scale analysis. For smaller
studies the cost per capturing experiment is however favourable with the
Nimblegen arrays.
24
Figure 3. In the first step a sequencing library is generated through DNA fragmentation, end-repair and universal adapter ligation. (A) The Nimblegen sequence capture
method uses probes that are synthesized on microarray slides, with a varying probe
length (60-90nt) to yield a uniform melting temperature. Library molecules are hybridized to 385K feature arrays in a custom incubation station for 65 hours, after
which unbound fragments are washed away and the enriched portion is eluted. (B)
The SureSelect method uses biotinylated RNA capture probes, 120 nucleotides in
length, to which the library molecules are hybridized in solution for 24 hours in a
regular thermocycler. The reaction is mixed with magnetic, streptavidin-coated beads
to capture probe bound library molecules, followed by washing and elution. In both
methods, the capture products are finally amplified by universal PCR by means of the
adapter sequences.
Reaction principles
Most biochemical reaction principles used to determine the identity of a given
nucleotide in a DNA sequence, are based on short oligonucleotide probes or
primers. The methods rely either on allele-specific hybridization or enzymeassisted allele discrimination. Some of these principles are discussed below
and shown in Figure 4.
25
Allele-specific oligonucleotide hybridization (ASH)
In many approaches the hybridization between complementary DNA strands
is employed for allele discrimination. Two allele-specific oligonucleotides are
designed to differ in the variable base position and hybridized to the target
DNA under stringent conditions (Wallace et al. 1979). The difference in thermal stability between the perfectly matched and the mis-matched probe is
used to distinguish between SNP alleles. Amplified target DNA can be hybridized to high-density oligonucleotide arrays containing allele-specific primers
for high-throughput genotyping (Pease et al. 1994; Chee et al. 1996). Perlegen
Sciences (closed operations in 2009) developed a system where long-range
PCR and array hybridization was combined for multiplex genotyping of
~50,000 SNPs (Hinds et al. 2005), which among other things was used in the
HapMap project. The GeneChip system by Affymetrix allows genotyping of
up to 1.8 million SNP and CNP markers by array hybridization. The hybridization characteristics of the probes are largely influenced by the sequence
surrounding the SNP and the reaction conditions. This requires careful SNP
selection and up to 40 different oligonucleotides per SNP to obtain high specificity. In dynamic allele-specific hybridization (DASH) (Howell et al. 1999)
the template-probe duplex stability is instead monitored during a temperature
increase. DASH has only been multiplexed for two SNPs per reaction and
used in small genotyping projects (Jobs et al. 2003). Hybridization-based approaches are simple and do not require expensive enzymes.
Allele specific probes for real-time detection
DNA amplification during PCR can be monitored directly in an allele-specific
way using double-labelled, allele-specific oligonucleotide probes. The 5’exonuclease or TaqManTM assay uses specialized probes and the 5’-3’ nucleolytic activity of Taq DNA polymerase for allele discrimination (Holland et
al. 1991; Lee et al. 1993; Livak et al. 1995). The probe contains a fluorescent
reporter label in its 5’-end and a quencher in its 3’end. The signal from the
reporter is suppressed when the two molecules are in proximity of each other.
During the extension step of PCR the probe is displaced and cleaved by the
polymerase, whereby the fluorophore is released and a detectable fluorescence
signal is generated. This assay has been frequently used for single marker
genotyping and serves as standard method. Molecular beacons are another
type of double-labelled probes that form a hairpin structure in solution leading
to quenching of the signal (Tyagi and Kramer 1996; Tyagi et al. 1998). Hybridization of the probe to a PCR product breaks the hairpin shape and separates the probe ends which gives rise to a signal.
Allele-specific primer extension (ASE)
The strategy is based on the ability of a DNA polymerase to only extend allele-specific primers with a perfectly matched 3’-end (Gibbs et al. 1989; Wu et
26
al. 1989), either directly on the target sequence or on already amplified samples. The discrimination between matched and mis-matched probes by the
polymerase is however not perfect which leads to low specificity. One approach to increase the selectivity of the reaction is to include the enzyme
apyrase. It inactivates the nucleotides and competes with the primer extension
action of the DNA polymerase (Ahmadian et al. 2001). A beneficial feature of
the principle is a cost reduction due to the use of un-labelled probes, although
it still involves some visualization of the genotyping results. Allele-specific
extension has been applied for interrogating low numbers of variants, as in
Illumina’s custom genotyping system using cylindrical glass microbeads (Veracode beads), and in specialized assay formats (Mitani et al. 2007).
Single base primer extension (SBE)
The incorporation of a single nucleotide in an allele-specific manner is called
single base primer extension (SBE) or minisequencing (Syvänen et al. 1990).
One detection primer that anneals directly adjacent to the SNP position in the
target molecule is used per reaction. The detection primer is extended with a
nucleotide by a DNA polymerase, guided by the complementary template
strand. The accuracy of the enzyme governs the high specificity of the reaction. The extension reaction is independent of the nature of the variable nucleotide and the sequence context surrounding it, which allows most SNPs to
be analysed at the same reaction conditions. Furthermore, the high specificity
makes the reaction principle suitable for multiplexing. In an early comparison,
a microarray-based minisequencing assay showed approximately one order of
magnitude better genotype discrimination than allele-specific hybridization in
the same assay format (Pastinen et al. 1997). The robust underlying principle
of minisequencing is the basis for its successful integration into numerous
assays at different levels of complexity, in various reaction formats and with
different labelling strategies, for the analysis of nucleic acids. It is the basis of
the SNPStream system from Beckman Coulter (Bell et al. 2002) and the Infinium II platform provided by Illumina (Steemers and Gunderson 2007). Further
the minisequencing principle has also shown excellent performance in quantitative applications, especially with tritium-labelled nucleotides, or when using
mass spectrometric detection (Syvänen et al. 1993; Buetow et al. 2001; Milani
et al. 2007). Minisequencing is the underlying reaction principle in the tagarray based genotyping system which was used in Studies I-III of this thesis.
This system is discussed in more detail in the present study section.
27
Figure 4. Hybridization-based and enzymatic principles for allele discrimination. The
reporter and quencher of the dual-labelled probes are denoted R and Q, respectively.
28
Oligonucleotide ligation (OLA)
As for the polymerase-mediated extension reaction, the joining of two perfectly matched DNA strands by DNA ligase is a highly specific event. DNA
ligase has hence been used to discriminate between two allele-specific probes
that are hybridized at a SNP site in a PCR product. When there is a perfect
match, the ligase seals the nick between one of two allele-specific probes and
a third, locus specific probe binding upstream of the SNP site. This principle,
denoted oligonucleotide ligation assay (OLA) was first described in 1988
(Alves and Carr 1988; Landegren et al. 1988) and is applied in the SNPlexTM
system from Life Technologies (Applied Biosystems, http://www.lifetech
nologies.com) for 48-plex SNP genotyping. The concept was adapted in the
development of the padlock probe (Nilsson et al. 1994). The padlock probe is
a linear oligonucleotide which contains the sequence from both OLA probes
in either end, joined together with a linker sequence. The highly specific reaction requires two adjacent recognition events followed by an enzymatic ligation to form a circular molecule. The circle is amplified by rolling circle amplification or by universal PCR by sequences in the linker before signal readout. Padlock probes have recently been used for single molecule detection in
situ (Larsson et al. 2010). Single base extension was combined with ligation of
molecular inversion probes for high throughput SNP genotyping of 12,000
markers (Hardenbol et al. 2003; Hardenbol et al. 2005), commercially offered
by ParAllele Biosciences, now part of Affymetrix. The polymerase extends
the “gap” that is created at the SNP position before sealing by ligation. This
approach requires only one probe per queried SNP but separate extension reactions with one type of nucleotide present in each. Un-circularized probes are
digested by exonuclease following amplification of the circles and signal read
out on arrays.
In the GoldenGate assay, developed by Illumina (Fan et al. 2003), the enzymatic discrimination is combined with hybridization for a highly specific
genotyping assay for up to 1,536 SNP. Hybridization of allele-specific primers
directly to genomic DNA is combined with an extension reaction and a subsequent ligation to a locus-specific probe. All oligonucleotides contain sequences for universal PCR amplification of the generated template with labelled primers, and the locus-specific probe carries a barcode for capturing of
the ligated probe complex to bead-arrays, Veracode beads, or bead-chips.
Invasive cleavage
In the invasive cleavage method, the Invader assay, a structure-specific flap
endonuclease (FEN) is used to cleave a complex formed by two allele-specific
overlapping oligonucleotides that are hybridized to a target DNA containing a
polymorphism (Lyamichev et al. 1999). The three-dimensional structure triggers the cleavage of the oligonucleotide and the difference in cleavage rate
29
between complementary and non-complementary probes is the basis for allele
discrimination. The Invader assay principle has been developed for a variety
of assay formats and detection strategies (Olivier 2005). Its advantages include high accuracy, the isothermal reaction which does not require thermal
cycling and the possibility to perform genotyping directly on genomic DNA.
Despite the improvements, the method uses fairly large amounts of input DNA
and was only successfully multiplexed for the analysis of ~100 SNPs (Ohnishi
et al. 2001).
Sequencing
DNA sequencing is nowadays the most commonly used method to identify
specific nucleic acid sequences. It has played a great role in the identification
of novel sequence variants but is also used in small scale genotyping studies
and for validation. Sequencing technology was pioneered in the mid seventies
by Maxam and Gilbert through their chemical sequencing method (Maxam
and Gilbert 1977). At the same time, the less cumbersome chain-terminating
method, based on incorporation of labelled terminating dideoxynucleotide
triphosphates (ddNTPs) and electrophoretic separation of DNA fragments of
varying length, was published by Sanger et al (Sanger et al. 1977). The
method uses transfected cloning vectors or PCR products as template for sequencing. In the early versions the sequencing primers were radioactively
labelled and only one out of four terminating nucleotides was present in each
reaction in addition to unlabelled deoxynucleotide triphosphates (dNTPs). The
development of this method has been remarkable over the years. Today
Sanger sequencing is performed in a single reaction with all four, fluorescently labelled nucleotides present (Smith et al. 1986; Prober et al. 1987).
Other improvements include the development of whole-genome shotgun sequencing and separation of the sequencing products on automated highthroughput capillary DNA sequencers (Drossman et al. 1990). The sequence
of Drosophila melanogaster was published in 2000 as the first application of
the whole-genome shotgun approach to sequencing of an animal genome
(Adams et al. 2000).With sequencers such as the ABI3730xl DNA Analyzer
from Life Technology, up to 1000 bases can be read in 96 or 384 samples per
run. The technique has been crucial in the large genome projects but has also
played a major role as a standard method for many genetic research approaches.
Pyrosequencing is an alternative principle for sequencing (Ronaghi et al.
1996; Ronaghi et al. 1998). Different systems applying the principle are today
commercially available through the company Qiagen (http://www.pyrose
quencing.com). Its underlying reaction principle is sequencing by synthesis
(SBS) (Melamede 1985), in which SBE has been further developed to determine the sequential nucleotide incorporation during strand synthesis by the
DNA polymerase. In pyrosequencing the nucleotides are resolved through an
30
enzymatic cascade resulting in a detectable signal. Nucleotide incorporation
initiates a release of pyrophosphate (PPi) which together with adenosine phosphosulphate (APS) act as substrates for ATP generation by ATP sulfurylase.
The energy in ATP is converted to detectable light by luciferin and the signal
is directly proportional to the number of incorporated nucleotides in a nucleotide flow. Each nucleotide cycle contains one of four dNTPs and between
dispensions the unincorporated nucleotides and the remaining ATPs are degraded by apyrase. The degradation is crucial to avoid washing procedures in
between cycles and decrease accumulation of light signals that lead to high
background levels at longer read-length. In an effort to increase the readlengths, up to 100 bases could be determined (Gharizadeh et al. 2002). Reliable quantification of the signal is a problem when reading homopolymeric
sequences. The usage of multiple enzymes makes the assay rather expensive.
Pyrosequencing has been used in small scale genotyping projects and is frequently used for clinical applications.
Massively parallel sequencing
Despite the tremendous improvements, Sanger sequencing has reached its
limits with respect to cost and parallelization (Shendure et al. 2005), and is too
expensive for whole genome sequencing of large genomes. To stimulate innovation in the field, a sequencing race was initiated by the J. Craig Venter Science Foundation in 2003, to award the first complex (human or corresponding
complexity) whole-genome sequenced for less than 1000 USD. In 2004 this
strive was further encouraged through a large grant program for sequence
technology development by the NIH. The rationale was that 1000 USD would
be a reasonable cost to make multiplex large-scale sequencing a standard
method. Among numerous promising applications it would enable comprehensive analysis of the mutational spectrum of an organism. The efforts have
indeed led to the development of several emerging sequencing technologies
and new innovative solutions with the ability to process millions of sequence
reads in parallel; so called next or second generation sequencing technology
(Mardis 2008; Shendure and Ji 2008; Metzker 2010).
Ensemble-based methods involve multiple clonally amplified molecules. The
SBS strategy described above using DNA polymerases and ligases, has been
implemented for parallel analysis of DNA strands (Fuller et al. 2009). Nucleotides or short oligonucleotides are used for determining the base type. The
established methods all employ synchronized sequencing cycles in an iterative
fashion by adding a single kind of nucleotide at a time or by adding nucleotide
substrates that are reversibly blocked. Although the need for vector-based
cloning strategies is eliminated, all of the methods require an initial sequencing library preparation step. This involves fragmentation of the molecules and
incorporation of general priming motifs, i.e. adapter sequences, to both ends
of a fragment for amplification and sequencing. The platforms can determine
31
either one end, or the paired-ends of a given molecule. They also allow the
inclusion of barcodes for sample indexing and pooled sequencing approaches.
454 sequencing, today commercially provided by Roche (http://www.
roche.com), was the first MSP technique described, published in 2005
(Margulies et al. 2005). It is based on emulsion PCR, where library molecules
are clonally amplified en masse on agarose beads, and the pyrosequencing
chemistry for nucleotide discrimination conducted in a picotiter plate format
(Leamon et al. 2003). Currently the Genome Sequencer FLX system generates
more than one million high-quality reads per each 10 hour instrument run with
a read length of about 500 bases. One drawback with the technique is the
problem of reading homopolymers, which is an effect of the pyrosequencing
reaction. The sequential flow of nucleotides however diminishes the occurrence of substitution errors in the sequence. The long reads simplifies the
alignment and makes the system ideally suited for de novo sequencing of
whole genomes and transcriptomes, including partially non-unique sequences.
The second MPS system was developed by the company Solexa and introduced on the market in 2006 (Bennett 2004; Bennett et al. 2005) (Figure 5).
Today the Solexa technology is commercially available in the Genome Analyzer (GA) platform from Illumina. Library molecules are bound to oligonucleotide arrays in a flowcell and amplified by solid phase bridge amplification
to generate millions of clonal template clusters. Each round of SBS involves
the addition of four differently labelled and reversibly terminating nucleotides
and four-color imaging of the incorporated base. The 3’-terminus of the incorporated base is de-blocked and the fluorophore is removed before the next
sequencing cycle is initiated. Each flow cell can hold eight separate reactions
or samples. The current throughput of the GAIIx system is 15-20 million
paired-end reads per sample of ~2x100 bp length, adding up to a total of 20
Gbp generated in each ten day machine run. In 2010, the company announced
the launch of an updated version of the system (HiSeq2000) that will generate
approximately ten times more data in each run, mostly due to better usage of
their flow cells which increases the surface area. The advantage with the
Solexa technology is the large amount of data generated at a fairly low price,
while the relatively short read-length still is an obstacle for some applications
and made it initially suitable for re-sequencing applications. With the constant
increase in read-length it has also been applied for whole human genome sequencing (Bentley et al. 2008) and it is the technology used by the 1000 genomes project.
The mutiplex polony sequencing by ligation method (Shendure et al. 2005),
uses mate-pair library molecules that are constructed by circularization of
template molecules, RCA, restriction cleavage and adapter ligation. The linear
library molecules contain two genomic target sequence tags of 17-18 bp, each
32
separated and flanked by universal sequences. The fragments are captured to
beads and amplified in an emulsion PCR step which is followed by immobilization in an acrylamide gel to form a monolayered array of beads. The sequencing strategy utilizes an anchor primer that hybridizes directly 5’ or 3’ of
one of the genomic sequence tags. Differently labelled degenerate nonamers
are added to the reaction, followed by a selective ligation between the anchor
primer and the complementary nonamer. The base identity at the query site in
the nonamer correlates with its label. After signal read-out the primer-probe
complexes are stripped away and the next cycle begins with the addition of a
new population of nonamers with a different query position. This permits sequencing with high specificity of 13 bp (6+7 from each end) in each genomic
tag and a total of 26 determined nucleotides per amplicon. Proof of principle
was shown by sequencing the genome of an Escherichia coli. It has been further developed in the Polonator G.007 (http://www.polonator.org), an open
platform provided by Dover Systems with open-source software and protocols
and off-the-shelf reagents at a very competitive price.
Figure 5. Schematic drawing of the Genome Analyzer system using Solexa technology for MPS. (A) Sequence library molecules are via their universal adapter sequences randomly hybridized to tags in one of eight lane of a microfluoidic flowcell.
This is followed by bridge amplification on the surface to form millions of clonal
molecule clusters (B). (C) The sequence of the molecules is deduced by SBS with four
differently labeled, reversibly terminating nucleotides present. Each incorporation of a
complementary base is followed by washing and (D) four colour imaging.
33
The sequencing by oligonucleotide ligation principle has also been applied in
the SOLiDTM system from Life Technologies. Library molecules are captured
to oligonucleotides on magnetic beads, amplified by emulsion PCR and attached to glass slides. The sequencing process starts with the annealing of an
adapter specific sequencing primer and a set of four fluorescently semidegenerate labelled di-base probes. A probe match results in a specific ligation
event and a fluorescent read-out where the label corresponds to the first two
positions in the probe. The last two bases of each probe including the label are
removed to enable the next sequencing cycle. Multiple cycles of ligation, detection, and cleavage are performed before the extension product is removed,
and the template interrogated in a second round of ligation cycles with a
primer complementary to the n-1 position. By this reset approach each base is
interrogated twice which enables a read accuracy check. Two drawbacks with
the technology are the short read length and the somewhat complex data output format in colour space. The current version of the instrument (SOLiDTM4)
can generate up to 100 Gb of mappable sequence or about 1.4 billion reads of
50 bp length per ~11 day run. With the SOLiD™ 4hq System upgrade this
scales to 300 Gb of sequence in 2.4 billion reads of up to 75bp in length per
run.
The three major players described above have all introduced smaller systems
during 2010 to better suit the needs of individual laboratories or for clinical
applications. These are convenient for targeted re-sequencing, pathogen detection and de novo sequencing of small genomes. General features include high
speed, easy workflows and simplified data analysis. In addition to the systems
mentioned above the company Complete Genomics (http://www.complete
genomics.com) offers whole-human-genome service sequencing with a technique that resembles the principles employed in Polony sequencing (Drmanac
et al. 2010).
Many of the problems related to ensemble-based sequencing methods can be
overcome with single molecule sequencing, also referred to as third generation
sequencing. These include the difficulty of generating long reads because of
phasing problems of signals from populations of molecules, the need for template amplification, and the fairly large consumption of sequencing reagents
and template material for library preparation. The company Helicos Biosciences (http://www.helicosbio.com) developed an amplification-free method
based on sequential SBS of single molecules attached to glass slides and signal readout using fluorescence microscopy (Braslavsky et al. 2003). The average read-length today is 35 bases. Pacific Biosciences and Life Technology
both have real-time techniques using a “free running” DNA polymerase for
single molecule SBS on their way. Pacific Biosciences (http://www.pacific
biosciences.com) are currently launching their SMRTTM method where DNA
polymerase molecules are attached to zero-mode waveguides (Levene et al.
34
2003). After the addition of single molecule templates the incorporation of
labelled nucleotides by each individual DNA polymerase molecule is detected.
The waveguide enables specific detection of incorporated labels among a
background of thousands of labelled nucleotides. The DNA polymerase incorporates 1-3 bases per second in the initial commercial release and generates a
long DNA chain in minutes. Life Technologies single molecule sequencing
system employs a similar but reversed approach. Here the single molecule
templates are attached to a solid support to which individual DNA polymerase
molecules are captured for sequence determination. This enables addition of
new enzymes when the polymerase activity decreases. The incorporation of
nucleotides is detected by using quantum dots and FRET chemistry. Ion Torrent use semiconductor technology for determining the identity of natural nucleotides that are sequentially incorporated by a DNA polymerase
(http://www.iontorrent.com). When a base is added to a DNA stand a hydrogen ion is released. The ion charge changes the pH of the solution which can
be detected by a sensor. Currently this technique is not at the single molecule
sensitivity level.
Exonuclease–mediated cleavage of labelled nucleotides can also be utilized
for sequencing (Werner et al. 2003). The same principle has been combined
with nanopores which can recognize single unmodified deoyribonucleoside
monophosphates (dNMPs) (Astier et al. 2006). The pores act as base sensors
when a dNMP interacts with the binding site and an adapter in the pore, resulting in a measurable change in ion current.
Assay formats
The most common format for conducting biochemical reactions has been in
solution in test tubes. Homogeneous assays, where all reactions take place in
the same test tube, require less laboratory handling but are limited in the multiplexing level of the reactions and are less sensitive due to lack of separation
possibilities. This format is among others applied in TaqMan assays (Holland
et al. 1991). Different solid supports have however gained importance due to
the possibility of parallelisation of samples and reactions. Solid-phase assay
formats enable immobilization of targets by capturing to oligonucleotides or to
chemically activated surfaces, which is convenient for multi-step reactions
that require multiple separation steps. The microarray format, first presented
in 1992 (Southern et al. 1992), consists of a miniaturized solid-support commonly made of glass. It facilitates parallel and miniaturized analysis with reduced costs and usage of DNA. Since the arrayed format was introduced a
number of supports and strategies have been developed for target organisation.
These supports include microtiter plates, micro-beads in solution or assembled
on microwells of fibre-optic bundles (Michael et al. 1998) or silica slides
(Gunderson et al. 2004), gel films (Dufva et al. 2006), and various types of
35
micro-structures (Summerer et al. 2010). Apart from wide usage in RNAexpression analysis, microarrays are commonly used in DNA analysis and are
gaining importance for analysis of proteins, antibodies and tissue. Microarrays
have been essential in the development of large-scale SNP genotyping systems. Oligonucleotide arrays can be produced by in situ synthesis of the molecules on the surface, and Affymetrix use a photolithographic process for their
on-chip synthesis. More commonly, oligonucleotides are immobilized to activated surfaces by different printing techniques. The early microarrays for
genotyping contained immobilized SNP-specific probes (Shumaker et al.
1996; Pastinen et al. 1997; Pastinen et al. 2000). “Tags”, on the contrary, are
barcode sequences complementary to sequence motifs included in the detection primers of enzyme-assisted genotyping assays. When combined with
genotyping reactions in solution, tags allow flexible assay design. In addition,
their generic characteristic enables the development of universal capture arrays, which simplifies manufacturing. Tag-arrays were initially applied to
SNP genotyping in a ligase assay (Gerry et al. 1999) and further combined
with SNP genotyping by minisequencing (Fan et al. 2000; Hirschhorn et al.
2000). This assay format has been utilized in Studies I-III of this thesis.
Miniaturization
Genetic analysis is becoming increasingly important in clinical genetics and
pharmacogenetics. Genetic variants that are found to be risk factors for inherited diseases will be analysed to predict the susceptibility and onset of disease,
or the treatment response and clinical outcome. In addition, the identification
and scoring of somatic changes that initiate cancers and the identification of
the allelic spectra of human pathogenic microbes, will also be important.
There is great hope that the enhanced understanding of the effect of genetic
variation may be implemented for diagnostic purposes to enable better, more
individual treatment of each patient.
Large-scale genotyping techniques require massive equipment and timeconsuming, labour intensive procedures and analysis. These methods provide
high-throughput analysis with a low price per genotype. Progress in the field
of clinical genetics is however dependent on new assay formats and robust,
yet flexible technology that is fast, sensitive and inexpensive per sample,
when smaller numbers of clinical samples are to be analyzed (Ohno et al.
2008; Weigl et al. 2008). Miniaturization has been an efficient means for enhanced and portable information technology. Also in life-science, micro-total
analysis systems (µ-TAS) or lab-on-a-chip have the capability of integrating
miniaturized components and assay functionalities of multiple reaction steps,
at pL-nL volume scale, into a single microfluidic device. The advantageous
features of these systems are automation, low risk of contamination, low reagent and sample consumption, decreased assay time, increased sensitivity,
and potentially high level of parallelization (Liu and Mathies 2009). Hence,
36
these technologies offer promising tools and analysis solutions for efficient
point-of-care DNA analysis. µ-TAS systems commonly consist of a microfluidic network for biochemical reactions in a device often made of glass or
polymers. In addition, they require a number of control components such as
heaters and temperature sensors for thermocycling, microvalves for separating
functional units, actuators and micropumps for sample transportation and flow
control and magnets for sample manipulation using beads.
The adoption of lab-on-a-chip assays to routine DNA-based diagnostics has
been slow. This is mainly due to the difficulty of integrating multiple functional steps required in genotyping assays, in particular sample treatment and
signal detection (Whitesides 2006). The simple assay formats and short reaction times of single-reaction assays make them attractive. Consequently, the
commercially available biochips use electrophoretic separation and/or homogenous amplification methods for analysing single/low numbers of markers. Two such examples are the chips provided by Agilent Technologies for
high resolution capillary electrophoresis in the Bioanalyzer, and Fluidigm’s
(http://www.fluidigm.com) digital real-time PCR assays for quantification of
DNA sequencing libraries for subsequent MPS (White et al. 2009). Integrated
multiplex genotyping assays often rely on separation of several homogenous
reactions in different channels (Summerer et al. 2010). Genetic tests for complex diseases will require multiplex genotyping of multiple SNPs in multiple
genes in each DNA sample. Ultimately, targeted clinical analysis of
genes/mutations will be replaced by large-scale sequencing of disease gene
pathways and networks. Thus, there is a need for completely automated microfluidic devices with several reaction steps and a solid-phase support for
separation of reagents and analytes during the process. To increase assay
complexity while reducing hardware and system complexity as well as costs
per test, is the true challenge. In Study III we have developed a multiplexed
genotyping assay for cancer diagnostic in a biochip using freeze-dried reagents for simplified reaction integration.
Coding and decoding
Previously, radioactivity and/or gel electrophoresis was the primary detection
method of nucleic acid assays (Syvänen et al. 1993). Today, commonly used
features for labelling are fluorescent molecules (Pastinen et al. 1996), specific
mass of molecules (Karas and Hillenkamp 1988; Kim et al. 2004), the use of
chemiluminescent enzymes (Nyren et al. 1993) and various sequence tags.
The assay format influences the choice of allele label. Since homogenous assays do not include any separation steps, the label must indicate when an allele-specific reaction has occurred. These molecular changes can with favour
be detected using interacting fluorophores and observed by for example fluorescence resonance energy transfer (FRET). To be able to score different al37
leles of multiple products, unique coding of each molecule is necessary. The
integration of a variant label and a locus label can be achieved during the
process of the experiment and during assay design. Parallel read-out of individual signals depends on physical separation of products by for example capillary electrophoresis or on microarrays. Detection methods include mass
spectrometry and microscopes or high-resolution scanners with charged coupled device (CCD) cameras or photomultiplier tubes (PMT).
Data analysis and quality control
Large-scale genetic studies such as genotyping, sequencing, and gene expression analysis generate large amounts of data. In particular the output from
MPS machines is increasing dramatically. This is accompanied by challenges
for massive scale data storage and information technology to ensure proper
data management and interpretation of huge data quantities (Pop and Salzberg
2008; Horner et al. 2010). Computational analyses of complex data sets require (i) management of data including image analysis, signal processing and
background subtraction in an automated manner with minimal levels of manual input and visual inspection (ii) good tools and analysis strategies for base
calling, read alignment and variant identification or allele scoring (iii) automated and objective quality control. Further, downstream and application specific interpretation of data is needed for finding biological relevant answers. In
Study I and II of this thesis, tools for automated allele calling and data quality
control were developed.
38
Aim of this thesis
The overall aim of this thesis was to develop and apply strategies based on
single base primer extension (SBE) to study sequence variation in the fruit fly
and in humans. Specifically the aims were:
•
To improve the performance of the tag-array minisequencing system for SNP genotyping and implement it in novel applications.
•
To use the Genome Analyzer for massively parallel sequencing to
evaluate two hybridization-based methods for targeted sequence
capture.
39
The present study
As outlined in the introductory part, the analysis of sequence variation requires database resources to support assay design, approaches for reducing
genomic complexity to achieve sensitive and specific assays, robust principles
for allele discrimination, multiplex assay formats, and tools for simple data
analysis and assay quality control. The studies included in this thesis address
some of the issues listed above for the discovery and analysis of single nucleotide polymorphisms (SNPs) and contain technical improvements and new
applications of methods based on four-color single base primer extension
(SBE) chemistry.
Tag-array minisequencing
The tag-array minisequencing system for SNP genotyping was used in Studies
I-III of this thesis. It is based on polymerase-assisted minisequencing/SBE for
specific and robust allele discrimination and a tag-microarray format that allows medium to highly multiplex and parallel analysis (Lindroos et al. 2002)
(Figure 6). A major advantage of tag-array minisequencing is its four-color
detection system which enables the interrogation of the four DNA bases, and
hence any SNP variation, in a single reaction. This feature distinguishes tagarray minisequencing from commercially available SBE methods. Thus, the
system is flexible and can be designed for any panel of custom selected SNPs.
It is based on standard freely available equipment and reagents, making it easy
to implement in any regular laboratory. As compared to the large commercial
genotyping platforms, the tag-array minisequencing system is extremely cost
efficient per DNA sample, enabling genotyping of 200 SNPs at a cost of about
1.4€ per sample. The genotyping system has many convenient characteristics
that can be utilized in specialized applications and in novel formats. It is particularly useful for establishing genotyping panels for other organisms than
humans, where common predefined assays are not readily available.
Reaction steps and assay set-up
The tag-array minisequencing system consists of four different biochemical
reaction steps, signal read-out, and data analysis. In addition it involves a
number of preparative steps. Figure 7 outlines the method, and the section
40
below gives a brief overview of the system. Detailed information can be found
in Study IIb and Lindroos et al (Lindroos et al. 2002).
Genotyping requires an initial amplification of the target sequences containing
each SNP by multiplex PCR. As for many other genotyping technologies the
optimization of the PCR reactions is the bottleneck of the system and the PCR
comprises about half of the cost per genotype. By careful optimization and
also by using a “brute force” approach we were able to establish PCR reactions amplifying up to 43 genomic fragments in parallel in Study IIa. Primer
oligonucleotides, preferably amplifying short fragments, were designed with
the software Primer3 or Autoprimer (http://www.autoprimer.com). The remaining dNTPs and primers from the PCR reaction mixture are inactivated by
treatment with Exonuclease I (ExoI) and shrimp alkaline phosphatase (SAP),
prior to genotyping. The exonuclease activity of ExoI degrades the single
stranded primers by cleaving off nucleotides in the 3'- 5' direction and SAP
removes the 5’-phosphate groups of the nucleotides. This enzymatic clean-up
eliminates the need for any washing step or physical separation for product
purification.
Figure 6. (A) The tag-array minisequencing system, based on single base primer extension in solution with fluorescent ddNTPs. (B) The primers contain 5´-tag sequences
for capture of the minisequencing reaction products on generic microarrays carrying
complementary c-tag sequences in an “array-of-arrays” format. The image illustrates
one sub-array/sample showing the hybridization of extension primers for two different
SNPs with one being homozygous and the other being heterozygous in that particular
sample.
41
Multiplex minisequencing reactions are performed in solution for high efficiency and are cycled to increase the number of extended primers. The minisequencing primers are designed to anneal immediately adjacent and upstream
of the SNP site and are extended with fluorescently labelled ddNTPs (Figure
6A). To limit mis-incoporation in the reaction, a specific DNA polymerase
with a 3´-5´ exonuclease proof-reading capability is used. Each minisequencing primer is designed to contain a 5’ tag sequence for hybridization and identification on the microarrays. In Studies I-III, minisequencing primers were
designed for both DNA polarities of each SNP using the SBE primer software
(Kaderali et al. 2003) and the primer-tag combinations were evaluated with
the software Autodimer (Vallone and Butler 2004) to avoid oligonucleotide
sequences forming secondary structures. Tag sequences comprise 20 nt in
length which result in 420 or 1012 possible unique probe combinations that
enable highly multiplex analysis. In the studies included in this thesis between
11 to 119 different clinical mutations, SNPs or InDels were genotyped in parallel in each single reaction.
Figure 7. Procedure for assay set-up and genotyping of SNPs by cyclic minisequencing with capture on Tag-arrays. Required preparative steps are specified on the left
whereas the genotyping procedure is given on the right.
For multiplex signal detection, the SNPs are separated by hybridization to
glass microarrays with immobilized capture oligonucleotides (c-tags). The
extended minisequencing primers are via their tag sequences captured to complementary c-tags with known locations on the arrayed surface. The use of
42
generic tag sequences enables universal, non-SNP specific array designs. In
the studies included in this thesis we use an “array of arrays” format where the
parallel capacity of the microarray has been developed further (Pastinen et al.
2000). It enables multiple SNPs to be interrogated in several samples, analysed in individual sub-arrays on the same microscope slide (Figure 6B). This
format is accomplished by a silicon rubber grid creating separate reaction
chambers. Either 16 or 80 samples can be simultaneously analyzed for up to
~580 or ~200 SNPs respectively, generating as much as ~16,000 genotypes
per slide. The “array of arrays” format is particularly well suited when comparing reactions or assay performance (Lovmar et al. 2003), (Hultin et al.
2005), as in Study I and III, where both versions of the assay format have been
utilized. C-tags to be attached to the microarray slides contained 15 T-residues
as spacers and a NH2-group at their 3'-ends by which they are covalently coupled to CodeLink™ Activated Glass Slides. The microarray slides were
printed with spots of around 150 µm in diameter and a spot-to-spot distance of
200-210 µm.
The microarrays are scanned at four wavelengths to measure the signal intensities of the incorporated fluorescently labelled ddNTPs, and allow deduction
of the genotypes of each SNP. The signals are quantified and genotypes are
assigned based on scatter plots with the logarithm of the sum of both fluorescence signals, log(a+b), plotted against the signal ratio or fraction a/(a+b). The
measured intensity of the incorporated nucleotides is visualized as three genotype clusters. We used the custom software SNPSnapper for genotype assignment. In Study II, automatic allele calling was also performed with the custom
developed SNPmapper software.
Silhouette scores for quality assessment of genotype data
Multiplex genotyping assays require good and automated quality control.
Quality control is performed both on the sample level and by including determined oligonucleotides in the reactions. Moreover, the level of deviation between the observed and the expected distribution of the results and the segregation of alleles are commonly used quality measures. In the tag-array minisequencing system, oligonucleotide probes are included at each step of the reaction protocol for comprehensive quality control (Lovmar et al. 2003). In Study
I of this thesis we addressed quality control of genotype data, and developed a
general tool for automatic evaluation of allele calling and quality assurance of
SNP genotype clusters. The quality measure was applied to evaluate different
DNA polymerases (Study I) and freeze-dried reaction mixtures (Study III) in
the tag-array minisequencing system.
43
Silhouette scores
“Silhouettes” were established as a graphic aid for interpretation and validation of cluster data (Rousseeuw 1987). Silhouette scores provide a measure of
how well a data point is classified when it is allocated to a cluster. This makes
them well suited for assessing the quality of genotype results when genotype
assignment is performed with data based on scatter plots. A Silhouette takes
into account both the tightness of the clusters and the separation between
them. A robust SNP genotyping assay is characterized by large distances between the three genotypes clusters and small distances between the data points
within the same cluster.
Figure 8. Scatter plot used for genotype assignment in the tag-array minisequencing
system illustrating the principle for the Silhouette s(i) calculations for a given data
point (i). The lines indicate the average distance to all points in the same cluster a(i)
and to all points in the other two clusters b1(i) or b2(i).
The Silhouette calculation for one data point in a typical scatter plot obtained
by a SNP genotyping assay is illustrated in Figure 8. For each given data point
(i) in a scatter plot, the Silhouette s(i) is defined as:
where a(i) is the average distance to all data points in the same genotype cluster and b(i) is the average distance to all data points in the closest cluster. Max
and min in the formula denote the largest or smallest of the measures in the
brackets. The average Silhouette values are calculated for each cluster (average Silhouette width) and for each SNP assay (Silhouette score) to generate a
single quality measure for each SNP assay ranging from 1.0 to - 1.0. When the
average distance from a data point to the other data points within the same
cluster is smaller than the average distances to all data points in the closest
44
cluster, a Silhouette value close to 1.0 is obtained. A value close to zero indicates that the data-point could equally well have been assigned to the
neighbouring cluster. A negative Silhouette is obtained when the cluster assignment has been illogical, and the data point is actually closer to the
neighbouring cluster than to the other data points within its own cluster.
We developed the program ClusterA to calculate numeric Silhouettes for clustered data obtained from fluorescence signal ratios. With the program the distance between data points can be measured either in one dimension, for example along the x-axis, or in two dimension using vectors, as shown in Figure 8.
The program also provides the mean, variance and F-statistic for the input
data.
Enzyme performance
Several DNA polymerases have been modified to incorporate fluorescent
ddNTPs and dNTPs at equal efficiency in Sanger sequencing. The ThermoSequenase DNA polymerase was engineered to incorporate ddNTPs with an
approximately two-fold preference over dNTPs (Tabor and Richardson 1995)
and recently additional thermostable enzymes have become available that are
compatible with fluorescent ddNTPs. Since the DNA polymerase is the major
factor for determining the specificity and efficiency in the minisequencing
reaction, a comparison of different DNA polymerases was of interest.
In Study I we compared the three DNA polymerases TERMIPol, Therminator
and Klen-Thermase to the ThermoSequenase enzyme, previously routinely
used in our tag-array minisequencing system as well as in many other laboratories. ClusterA was applied to calculate Silhouette scores on genotype clusters for 26 SNPs acquired when comparing the performance of the four DNA
polymerases with the tag-array minisequencing system. All reactions were
performed at the same reaction conditions with optimized protocols regularly
used for genotyping with the ThermoSequenase enzyme. The "array of arrays"
format (Pastinen et al. 2000) was useful for the comparison and enabled us to
test all enzymes on the same microscope slide. We applied a non-stringent
genotyping approach to be able to detect differences between the enzymes.
The performance of the DNA polymerases was evaluated by comparing genotyping success, fluorescent signals, signal-to-noise ratios, and cluster quality
(Table1) for each enzyme and SNP.
Figure 9 illustrates genotype clusters from three SNP assays that yielded different Silhouette scores in our study. Based on our evaluation we concluded
that SNP assays with Silhouette scores >0.65 may be accepted automatically
while assays with Silhouette scores <0.25 should be failed. Assays with Silhouette scores falling between these values could be accepted or failed after
45
manual inspection. These recommendations are in line with Liu et al., who
included Silhouette calculations in the algorithm used to interpret the data
from the Affymetrix 10K HuSNP hybridization microarray (Liu et al. 2003).
By using 0.65 as cut-off, 70–76% of the SNP assays would have been successful in this study. Due to the non-stringent genotype calling strategy very low
Silhouette scores were obtained for some SNP assays, which normally would
have been failed.
Figure 9. Silhouette scores for genotype clusters from three SNP assays. The results
are from 16 samples genotyped in duplicate. Squares and circles correspond to homozygotes for allele 1 and allele 2 respectively, and triangles are heterozygote samples.
The SNPs are denoted by their dbSNP identification number, and the DNA polarities
analyzed. Silhouette scores are shown in the right hand upper corner of each panel.
Table 1. Enzyme performance
Silhouette Silhouette
score
score
(average) (highest)
Genotype calls1
S/N2
(average)
Correct
Errors
n
%
n
%
n
%
TERMIPol
0.72
20 25.3
2337 98.9
18 0.8
4.3
Therminator
0.69
15 19.0
2323 98.3
32 1.4
3.6
KlenThermase
0.74
22 27.8
2346 99.3
10 0.4
8.0
ThermoSequenase 0.71
22 27.8
2324 98.3
34 1.4
8.9
1
Number of genotype calls (n) and call rate (%).
2
S/N were calculated by dividing fluorescence intensity values from nucleotides corresponding
to a true genotype (signal) by the fluorescent intensity value from the remaining ddNTPs
(noise). Both signals and noise were measured from the same spot.
All four enzymes performed satisfactory in our minisequencing assays. KlenThermase showed the highest average Silhouette score, the tightest distribution of Silhouette scores and generated more correct genotypes than the others
(Table 1). In Figure 10 below, the average fluorescent signal intensities are
displayed with respect to SNP genotype and nucleotide, illustrating the true
and incorrect signals generated by each enzyme. The signal intensity levels
were comparable apart from for the T-signals which were lower for two of the
46
enzymes. In general the levels of correct signals were high for KlenThermase
with little mis-incorporation. In conclusion, KlenThermase showed the over
all best results in our study at a competitive price.
Figure 10. Average fluorescent signal intensities with respect to SNP genotype. Intensity levels are plotted on the y-axis for each ddNTP type and SNP genotype. Signals
for homozygotes (AA, CC, GG and TT), heterozygotes (AX, CX, GX and TX) and
from mis-incorporated ddNTPs of each nucleotide type (XX) are given for each DNA
polymerase.
Tools for gene mapping in Drosophila melanogaster
For efficient gene mapping in the model organism Drosophila melanogaster a
map position needs to be refined from a whole chromosome or chromosome
arm to about 50 kb, which is the practical limit for recombination mapping.
The resolution of available SNP maps in Drosophila were until recently insufficient for fine-mapping and the evolving sequencing technologies still too
costly for large screens. Therefore, we developed a dense SNP map and a set
of tools in Study II, to accelerate large-scale gene mapping. All laboratory
steps and the protocols are thoroughly described in Study IIb.
47
SNP map and FLYSNPdb
A high-resolution SNP and InDel map was established and organized in the
publically available FLYSNPdb. A detailed description of the SNP map and
database creation can be found in Publication IIa and Chen et al (Chen et al.
2009). In short it included amplification and sequencing of fragments in nonrepetitive, non-coding regions equally distributed along each major chromosome arm of Drosophila (X, 2L, 2R, 3L, and 3R). Three to six divergent Drosophila strains for each chromosome were chosen according to common usage
in genetic screens and suitability for mapping purposes. We identified a total
of 109,369 SNPs (2,244 amplifiable SNP marker regions). The average distance between SNP markers and thus, the resolution of the map is 50.2 kb,
which on average comprises 6.3 genes. Marker and strain information are
found in the FLYSNPdb via the FlySNP homepage (http://flysnp.imp.ac.at).
Genome-wide genotyping assays for gene mapping
To establish a resource for fast gene mapping in Drosophila, we used the information from the high density SNP map to design a panel of 295 genotyping
assays for SNPs and InDels, equally distributed across the Drosophila genome, with an average spacing of 390.5 kb. Initially, a larger number of genotyping assays were designed and the most optimal and best performing assays
were included in a final set. Two fly samples with divergent genotypes were
used for each chromosome. Strain and assay information for the final set is
included in Table 2 below.
Table 2. Genotyping assays
Chr Fly strains
X
2
3
CSisoX (Bloomington #6364)
Bloomington #98
CSiso2/CyO
ORiso2/CyO
CSiso3 (Bloomington #6366)
Bloomington #576
SNP assays
58
2L = 59
2R = 60
3L = 58
3R = 60
58
Multiplex
PCR
reactions
5
Fragments
per reaction
13
5-14
11
3-16
5-23
Both the PCR reactions as well as the genotyping assays were designed in a
highly multiplex fashion using a “brute-force” approach. For this purpose we
fully utilized the flexibility and the two versions of the “array-of-arrays” format. Initially the genotyping assays were set up in the format which enables
genotyping of a larger number of SNPs and subsequently the format was redesigned for genotyping 80 recombinants in the final set of SNPs on each
microarray slide. One microarray slide for chromosomes X, 2, and 3, respec48
tively, were established for discriminating between the selected stocks (Figure
11). If using other strains for mapping, the established microarrays still are
potentially useful, with a varying loss of informative SNP assays. Moreover,
new minisequencing assays can easily be established from the SNP map, if a
different SNP set needs to be tested.
Figure 11. Scanning images for one sub-array from each of the three established microarrays. The visualized sub-arrays represent the samples CSisoX, CSiso2 and
CSiso3 respectively. Each sub-array is scanned at four different wavelengths corresponding to the fluorescent label of each nucleotide, indicated in the upper left corner
of each image row.
SNPmapper for simplified data analysis
The combining of the microarray data with the phenotypic information is not
trivial. We therefore designed a software tool called SNPmapper to automate
the interpretation of the genotype data and the transformation into a genetic
map position. SNPmapper is a platform-independent, freely available software
for allele-calling and for correlating and displaying genotypes and phenotypic
49
information in a user-friendly and colour-coded graphical output. The recombinant chromosomes are sorted by phenotype and breakpoint, illustrating all
recombination events. Genotype calling is achieved by applying a clustering
algorithm on the fluorescence intensity values. Custom options for background corrections and exclusion of poorly performing samples are also available.
Resources for fine-mapping
The mapping resolution of the minisequencing method was 3 Mb on average.
The identification of the gene of interest thus requires further fine-mapping.
Fine mapping can be achieved by (i) comparison to or complementation with
known mutants (ii) using chromosomal deficiencies or P-element insertion
mutants (iii) low scale SNP genotyping (iV) RNAi on genes in mapped region. Two different tools for fine mapping were established in Study II. We
generated fly stocks containing two closely-linked EP elements at defined
genomic intervals. The recombination events in the segment between the EP
insertions are detectable by eye-colour. Three to 16 EP lines are available for
the different chromosome arms, which makes it possible to narrow down the
mutated region to a 10 - 20 kb interval. A set of 1400 additional genotyping
assays were also pre-computed for local genotyping of the 2EP intervals.
Mapping of genes underlying muscle development
To verify the potential of the resources, we mapped mutants from a screen for
genes required for Drosophila embryonic muscle morphogenesis. To screen
for genes on chromosome 2, ORiso2 lines were mutated and muscle defects
were observed. Totally 140 muscle mutant lines were recovered. About 100
recombinants per mutant were created for mapping with the minisequencing
system by crossing the mutagenized stocks with CSiso2. Totally, 1713 samples were genotyped in the screen and more than 160 000 genotypes were
generated. Of these, 549 samples were typed in the most dense assay panel at
the time, comprising 115 SNPs on chromosome 2. The success rate for these
assays was 90.56 %, and the assay accuracy for 4-5 control samples was
99.67%. The average final resolution of the mapping method was 2.8 Mb,
which is determined both by the density of successful genotyping and by the
number of recombination events. The gene Sticks and Stones, known to play a
role in muscle morphogenesis, was used as a positive control for the screen.
The gene was mapped correctly to a region of 2.3 Mb which demonstrated the
robustness of the presented SNP mapping tools. By mapping 14 genes in less
than four months, the practical feasibility of our system was shown.
50
Tag-array minisequencing in a lab-on-a-chip
In Study III we established genotyping assays with the tag-array minisequencing system to be implemented in a lab-on-a-chip assay. Assays for multiplex
DNA analysis are important for point-of-care testing in the doctor’s office and
for other types of on-site applications. As a relevant model for a diagnostic
chip the prototype panel was developed for five mutations in the gene encoding the tumour suppressor protein (TP53), which are the most frequently mutated sites in common cancers (Petitjean et al. 2007; Petitjean et al. 2007).
Additionally we included 13 common SNPs in the same gene for evaluation of
the assay performance. Reaction protocols were established and optimized for
each step of the genotyping procedure. The multiplex PCR reaction amplified
13 fragments containing the 18 markers, and genotyping was performed by an
18-plex minisequencing reaction. One obstacle for complete integration is
liquid handling and external reagent supply. To enable integration into a microfluidic chip, we therefore sought to reduce the hardware complexity by
storing all reagents needed in the genotyping protocol on the chip during fabrication. Since the tag-array minisequencing protocol is based on four consecutive reaction steps with no washing or separation of reagents in between,
the storage of reagents on-chip abolishes the need for external reagent supplies.
Freeze-drying of reagents
One approach for simplifying on-chip storage is freeze-drying or lyophilisation of the reaction mixtures. Lyophilisation is a dehydration process which
involves initial freezing of a substance followed by a reduction of the aqueous
content at low pressure. Freeze-drying has been widely used by the food,
pharma, and biotech industries to preserve and extend the shelf life of a product, and to enhance transportation due to reduced weight. The procedure has
further been utilized to stabilize reagents for biochemical assays, mainly “master mixtures”, consisting of buffers, nucleotides and primers. Thermostable
DNA polymerases have however also been freeze-died (Klatser et al. 1998),
and applied for amplification-based pathogen detection (Aziah et al. 2007;
Aitichou et al. 2008) or for genotyping in miniaturized assays (Focke et al.
2010).
We freeze-dried reagents for the three enzymatic reaction steps of the minisequencing method; PCR, PCR-cleanup and minisequencing, as well as the hybridization buffer. These reactions include two different types of thermostable
DNA polymerases used in the PCR and minisequencing reaction, and the
thermo-sensitive enzymes Exonuclease I (ExoI) and shrimp alkaline phosphatase (SAP) for PCR purification. During freeze-drying the molecules are
exposed to stress which may lead to reduced enzyme activity when rehy51
drated, mainly because of altered molecule structure. The addition of stabilizing substances, lyoprotectants, may enhance the enzyme stability and the recovery rate. Lyoprotectants might however also affect the overall reaction
performance by themselves inhibiting the active molecule. We performed
combinatorial real-time PCR experiments to investigate the activity of the
DNA polymerase when freeze-dried both without and in the presence of a
combination of the four lyoprotectants; trehalose, polyethylene glycol (PEG),
mannitol and dextran, at different concentrations. The enzyme activity was
high for all concentrations of lyoprotectants tested and no inhibitory or cooperative effects from lyoprotectants were observed. Although high performance
was expected for the robust DNA polymerase, confirmed activity for the special SmartTaq enzyme suitable for multiplex PCR, was important. All reactions that included PEG resulted in a lower PCR yield and consequently we
removed it from all mixtures.
The stability of the active constituent over time is critical. When performing a
long-term stability test on the DNA polymerase, we obtained nice PCR products using dry mixtures stored up to six month. This result was attained even
when low concentrations of lyoprotectants were included in the mixture. The
novel aspect of our study is the freeze-drying of the thermo-sensitive enzymes
ExoI and SAP, which are expected to be more fragile and sensitive to the
freeze-drying procedure. We performed accelerated stability tests on these
enzymes, which enabled a prediction of the shelf life in a shorter time span. In
our experiment freeze-dried samples of ExoI and SAP were incubated at elevated temperatures for five days, and the residual enzymatic activities were
measured with two fluorescence assays. These experiments revealed a clear
stability dependence both on storage temperature and lyoprotectant concentration. The experimental data was used to predict the half life of the enzymes to
50 days for ExoI and 200 days for SAP stored at-21ºC. Our results are the first
to show the feasibility of freeze-drying ExoI and SAP. In the establishment of
optimal reaction protocols for the integrated assay we also considered the effect of lyoprotectant accumulation in consecutive reaction chambers.
Systematic genotyping comparison
We next evaluated the freeze-dried reagent mixtures in the complete genotyping protocol. We performed a systematic genotyping comparison where reagents in each of the reaction steps were replaced with dry reagent mixtures in
a stepwise manner. Twelve different reagent combinations were investigated,
using one cell-line DNA sample (Figure 12A), and analysed on the same microarray to minimize technical variation. Highly identical results are illustrated by the scan images for combination 1 and 12 using either liquid or
freeze-dried reagents in each reaction step (Figure 12B). These results are
persistent also through quantitative evaluation of the signal intensities and
52
background noise (Figure 11C) as well as signal to noise ratios (S/N) (Figure
11D). The experiment did not indicate a particular freeze-dried reagent that is
critical for the success of the assay.
Figure 12. (A) Combinations of freeze-dried and liquid reaction mixtures tested. Fluorescent scan images at four wavelengths for mixture combination 1 and 12, and signal
and background levels (C) or S/N ratios (D) for all SNPs in each of the 12 reagent
combinations.
We genotyped DNA samples from ten blood donors using either freeze-dried
or liquid reagents in each reaction step in two replicate experiments (LIQ1,
FD1 and LIQ2, FD2, respectively). The median S/N values were slightly
lower using freeze-dried reagents than with liquid reagents (Figure 13A). The
concordant results for the replicate experiments indicated a high reproducibility of the genotyping process using freeze-dried reagents. All different genotype classes were obtained, which enabled us to determine the genotypes for
theses samples and to evaluate the cluster quality. The cluster quality of successful assays was assessed by calculating numeric Silhouette scores, described in detail on page 43 of this thesis and in Publication I. Median Silhouette scores of 0.72 were obtained in the experiments performed using liquid
reagents, but acceptable scores (median = 0.50) were also achieved in the experiments with freeze-dried reagents (Figure 13B). An average of 175 genotypes out of totally 180 (97.2%) were successfully called in the two experiments employing liquid reagents and in the experiments with freeze-dried
53
reagents the corresponding number was 144 (80%). The genotype concordance was high in our experiments. Of the successfully called genotypes only
three discrepant results were observed, all using freeze-dried reagents.
Figure 13. (A) S/N ratios and (B) Silhouette scores for all SNP assays genotyped in
ten blood donors. Two replicate experiments were performed using either liquid (Liq1
and Liq2) or freeze-dried reagents (FD1 and FD2).
Genotyping in micro-fabricated structures
On the way towards full integration, we transferred the biochemical reactions
of the multiplex genotyping protocol to micro-fabricated test chambers. All
reaction mixtures were freeze-dried in dedicated reaction chambers. The 18
SNPs and mutation sites were genotyped in triplicate reactions using a cellline DNA sample at the same conditions as for the procedure in test tubes. The
reaction products were manually transferred between chambers for each reaction and the extended minisequencing primers were hybridized to a conventional glass microarray for signal read-out. Reaction chambers were milled on
a cyclic olefin copolymer (COC) substrate (Figure 14A). A special chamber
sealing solution was developed that allowed thermocycling of the chambers at
high temperatures as well as easy product recovery. Thermocycling and incubation was achieved by a custom-made temperature actuator with a Peltier
element for active heating and cooling, on which the reaction chamber was
fixed. The design and fabrication of the hardware components is detailed in
Publication III. The multiplex PCR product generated in the first test chamber
was compared to a PCR product obtained using freeze-dried reagents in a test
tube, and found to be comparable (Figure 14B). As shown in figure 14C, the
genotyping products generated in consecutive chambers were a proof-ofprinciple for successful transfer of the multiplex genotyping reactions, using
freeze-dried reagents, to the microchip format. This result also confirmed that
the freeze-dried reagents are compatible with the polymer material, commonly
54
used for the fabrication of microfluidic devices. Through the European
SMART-BioMEMS project the goal is to implement the deposition of freezedried reagents in a fully integrated lab-on-a-chip system for diagnostic genotyping including sample pre-treatment and fluorescence detection (Figure 14D
and 14E).
Figure 14 (A) Reaction chambers milled in a COC substrate to generate single chamber chips. (B) Electropherogram showing multiplex PCR products generated with
freeze-dried reagents in COC chambers (left) and test tubes (right). (C) Fluorescent
scan images for triplicate genotype assays performed in COC chamber and scanned at
two wavelength for Tamra-ddCTP (top) and Cyanine5-ddUTP (bottom). (D) Integrated COC chip containing reaction chambers for all biochemical reactions including
sample treatment and array hybridization. (E) Reaction control system. The devices
shown in panel D and E were developed in the SMART-BioMEMS project (project
no: IST-016554).
55
Sample complexity reduction for re-sequencing candidate
genes
The Genome Analyzer system from Illumina is based on the same SBE reaction principle as used in the minisequencing system. The four-colour chemistry has been developed further to allow massively parallel sequencing (MPS)
of short DNA fragments. The system uses an engineered high-fidelity enzyme
for enhanced read accuracy, and sophisticated reversible terminating fluorescent nucleotides. In Study IV, we utilized the sequencing system to evaluate
two methods for targeted MPS of candidate genes and variant discovery;
Nimblegen capture arrays from Roche and the SureSelect capture system from
Agilent. The systems are described in more detail on page 23 of this thesis and
in Figure 3. The goal of the study was to identify the optimal enrichment
method for continuous, custom-selected regions for sequencing using the Genome AnalyzerIIx. Both systems are commercially available to be used by any
researcher. The assay design is included in the service and only requires the
coordinates for the desired target regions from the customer. The methods are
both based on target selection by hybridization conducted either in situ on a
microarray, or in solution. Our comparison experimentally evaluated the complexity reduction in relation to probe design, successful target capture and
correct variant calling.
Study design
We aimed at targeting the sequences of 56 genes, including exons, introns and
5 kb of non-coding sequence flanking each gene, which made up a total target
region of 3.1 Mb. The SureSelect capture probes were designed to tile the
target region with 30 bp offset while the tiling was denser (five bp offset) with
the Nimblegen probes. A difference in the design of the two probe sets is the
way repetitive sequences are defined. The SureSelect method uses the repeatmasking algorithm to exclude defined sequence motifs with low complexity.
The Niblegen method on the other hand, uses the occurrence of each possible
15mer in the probe sequence as a measure of complexity, and discards everything above a certain threshold. A custom option for repeat-masking does exist
for the SureSelect method. The difference in probe design was reflected in our
experiments by the difference in size of the probe covered region but also by
lower capture success of low complexity regions. We prepared standard sequencing libraries from cell-line DNA of three related HapMap individuals.
Library molecules were hybridized either to oligonucleotide arrays or to
probes in solution that were captured to beads, eluted and amplified by universal PCR. All samples were sequenced on a single lane of an Illumina flow cell
in the same 36 cycle GA run. Image analysis, base calling, read alignment to
the human reference sequence and SNP calling was performed with the Illumina analysis pipeline and with open source programs and scripts.
56
Capture quality
The capture specificity or the portion that aligns to the target region defines
the over-sampling required for sufficient coverage of a sequencing library.
The average specificity was higher for the SureSelect captured libraries compared to the Nimblegen libraries (Table 3), and thus approximately 1.4-fold
more over-sampling is required when using the array method. More consistent
results between the SureSelect libraries also indicated a better reproducibility.
Table 3. Average performance
Probe cov- Specificity Sensitivity1
ered region (%)
(%)
(Mb)
Nimblegen
SureSelect
1
2
2.4 (78%)
1.7 (55%)
54
74
98.5 (76)
99.8 (55)
Seq. Coverage2 Norm.
depth >10x (%) coverage
(%)
74
91 (93)
68
157 98 (98)
68
Probe covered region, total target region in parenthesis
Probe covered region, region covered by both probe sets in parenthesis
We also determined the method sensitivity or the completeness of coverage.
The sensitivity is defined as the fraction of the target region which is captured
and covered at least once (Table 3). When the probe targeted regions were
considered, the SureSelect libraries showed a higher sensitivity. Nimblegen
libraries however showed very similar levels of completeness for the region
targeted by both probe sets, and covered more of the original target region.
The mean sequencing depth was notably higher for the SureSelect libraries
(Table 3) and the uniformity of coverage was slightly higher. The normalized
coverage, which is independent of the sequencing depth, was however similar
for both methods and even higher for the Nimblegen libraries. Of the targeted
bases 67-69% had at least half of the average coverage in the SureSelect libraries and for the Nimblegen libraries the same number was 61-72%.
Variant calling
SNPs were called by comparison to the human reference sequence. When
requiring a minimum coverage of ten reads, we were able to call between
2536 and 3371 SNPs per sample. Due to the larger probe-covered region, the
number of called SNPs was higher in the libraries that were prepared using the
Nimblegen method than in the ones prepared with the SureSelect method. The
opposite was seen when SNPs in the region targeted by both methods was
taken into account, probably due to higher sequencing coverage. We compared the called variants in our sequencing data to the genotypes from HapMap SNPs, annotated SNPs in the Ensembl variation database, and variants in
57
the 1000 genomes project pilot I data release (April 2009), in the target region.
The called HapMap SNPs in the probe specific regions and in the region covered by both methods followed the same pattern described above for the total
number of identified variants. In total 69-181 novel SNPs were identified per
sample of which two to 15 were annotated to exons. Two of the sequenced
samples are part of the 1000 genomes cohort which explained the low number
of novel SNPs found in our study. Our identification of additional novel variation however, emphasizes the need for high coverage sequencing, to disclose
all sequence variation within a region of interest.
For a number of SNP calls the position of a SNP agreed, but the base disagreed with the HapMap SNP data. Eleven positions that disagreed also after
manual inspection were analysed with Sanger sequencing. In ten cases, our
GAIIx sequencing calls were validated, while in one case the Sanger sequencing result agreed with the HapMap data. A set of novel SNPs were also Sanger
sequenced, which confirmed our SNP calls in eight cases, while two positions
had been called erroneously in the Nimblegen libraries. We evaluated the inheritance pattern of novel SNPs and used the errors to estimate the false SNP
discovery rate to 0.5% for the SureSelect library and 0.7% for the Nimblegen
library for the sample NA10860. The combined results in Study IV demonstrates the feasibility to use hybridization-based sequence capture and sequencing by GAIIx, to reliably detect known and discover new SNPs in the
targeted region.
58
Concluding remarks
Single base primer extension (SBE) is the most widely applied reaction principle in genotyping and sequencing systems. It is the basis of both the tagarray minisequencing system and the Genome Analyzer, which have been
utilized in the studies included in this thesis. Tag-array minisequencing provides robust genotyping at intermediate multiplexing levels and sample
throughput. We have shown its flexibility in Studies I-III, by employing and
adapting it to novel applications.
The software tool which we developed for calculating Silhouette scores in
Study I, can with favour be used for objective validation of the quality of SNP
genotyping assays. The calculation of Silhouette scores can be implemented in
large-scale genotyping systems where automated quality scoring is desirable.
A similar algorithm is indeed used for data from Affymetrix, and in the Illumina Infinium assays. Moreover, Silhouette scores are convenient to utilize
during assay development and optimization. As in Study I and III, the measure
can be applied to compare assay performance at different reaction conditions,
or the results attained with different SNP genotyping technologies. In our research group we have used Silhouette scores to evaluate different methods for
whole genome amplification (WGA), to be used as templates in the tag-array
minisequencing system (Lovmar et al. 2003). The program that we created for
calculating Silhouette scores is freely available. It can be used for quality assessment of the results from all genotyping systems, where the genotypes are
assigned by cluster analysis using scatter plots. The same algorithm has been
implemented as a function in the cluster package in the R programming environment (http://www.R-project.org) (R Development Core Team 2010), for
large-scale data analysis.
With the complete set of mapping resources that we developed in Study II, we
significantly reduced the time needed for large-scale gene mapping in Drosophila. The map of high-density sequence polymorphisms improved the resolution, and increased the speed of genome-wide mapping and cloning of genes.
Further, our tag-array minisequencing system in the special “array of arrays”
format is particularly well-suited for parallel genotyping of an intermediate
number of samples and SNPs in a cost-effective, open source manner. The
creation of the automatic allele calling software makes it easy to identify informative recombinants, which usually is the rate-limiting step in a mapping
59
project, and to quickly extract the relevant recombinants from a large number
of samples. The assays have been utilized to identify the transmembrane protein, Kon-tiki (Kon), which mediates myotube target recognition in the Drosophila embryo during muscle development (Schnorrer et al. 2007).
Our successful transfer of tag-array minisequencing assays to microfabricated
structures in Study III, shows the potential to develop a completely integrated
lab-on-a-chip for multiplex genotyping of cancer mutations. We found that
freeze-drying of reagents and storage on the biochip, is a good means for reducing the complexity of the multi-reaction DNA test. It is the first time that
all reagents needed in the genotyping reactions, including the thermo-sensitive
enzymes, have been freeze-dried and applied in a miniaturized assay. This
achievement is critical also for other enzyme-assisted genotyping assays. The
tag-array minisequencing system consists of sequential reactions without any
separation steps, which reduces the complexity of the control-hardware
needed. Miniaturized assays have convenient characteristics for simple on-site
DNA tests in healthcare, at crime scenes, at farms, or in environmental field
studies. Biochips containing freeze-dried reagents are robust and promising
for use outside of controlled lab environments. Important applications therefore include point-of-care testing in developing countries with limited infrastructure.
The Genome Analyzer system offers large-scale sequencing of short DNA
fragments. The system can be used for whole genome sequencing or resequencing of regions of interest, as in Study IV of this thesis. Target enrichment techniques are powerful but increase the cost for sample preparation and
experiment time. With increased sequencing throughput, reduced cost per
sequenced base, and simplified data analysis, whole-genome sequencing will
eventually overtake targeted sequencing for the analysis of individual samples.
The analysis of sub-parts of the genome will however remain relevant for
genotype analysis of candidate regions in large sample sets, and for clinical
sequencing. The two capture methods which we evaluated in Study IV, both
detected known SNPs, and discovered novel SNPs in continuous target regions. Our comparison demonstrated the feasibility for complexity reduction
of sequencing libraries by hybridization-based capture methods. The strengths
of the different methods should be taken into account in future development of
enhanced methods for complexity reduction. As DNA sequences with lower
complexity and variants within repetitive stretches can be biologically relevant, it will be essential to analyse all the variation in a given region. We
therefore conclude that probe design that enables better coverage across entire
target regions, will be increasingly important.
60
Massively parallel sequencing, often based on sequencing-by-synthesis, will
without doubt have a revolutionizing impact on genomics and genetics, and
fundamentally change our understanding of model organisms, and ultimately
of ourselves. These technologies are promising for comprehensive scanning
and screening of all types of genetic variation, or DNA modifications, and will
most probably substitute many techniques for genetic analysis used today,
including genotyping techniques. Such variant analysis enables further characterisation of common complex disease and normal variation, and the identification of rare genetic variants and their affects on the phenotypic level in different populations or individuals. The same techniques will also be important
for clinical genetics and facilitate more “personalized medicine”, and further
transform population genetics, microbiology, evolutionary genetics and the
characterization of ecological diversity (metagenomics). The sequencing race
continues! The X Prize Foundation (http://www.xprize.org) promotes the sequencing of 100 human genomes within ten days, or less at high accuracy, and
at 10,000USD per genome. The “third generation” sequencers are already
advancing. Important features are the direct analysis of single molecules in the
full complexity of a whole human genome. Kilobase pair long reads simplify
assembly, and make it possible to interrogate repetitive regions, and to reveal
haplotype information. The sensitive reactions allow direct detection of base
modifications during the sequencing process, and in situ sequencing in tissues.
Both basic research and translational efforts are required for biomedical progress. Therefore, I have enjoyed working with the development of new applications and tools, which have the potential to be used and adapted by many
researchers. The tremendous development in molecular technology has made
this a very exciting time for performing my research studies.
61
Acknowledgements
The work in this thesis was performed in the group for Molecular Medicine at
the Department of Medical Sciences, Uppsala University. It was possible
through financial support from the Swedish Research Council for Science and
Technology (VR-NT) and by grants from the EU Sixth Framework Program
(FLYSNP and SMART-BioMEMS projects).
There are many people who have been involved in and contributed to this
thesis and to whom I would like to express my gratitude. Some of you I would
especially like to mention here.
My supervisor professor Ann-Christine Syvänen. For giving me the opportunity to be part of your lab and for supporting my Boston intermezzo. For all
your encouragement and you solid biochemical knowledge. My second supervisor Mats Nilsson. For your interesting work in the field of molecular technology development. It has inspired me.
This thesis is a result of many collaborations, partly due to our involvement in
the EU- projects. I would like to thank the partners in the lab of Barry Dickson
in Vienna; Doris Chen, Frank Schnorrer and Irene Kalchhauser, as well as the
team from Hungary. Further I would like to thank the whole group involved in
the SMART-BioMEMS project and especially Monica Brivio, Anders Wolff,
Henrik Brus and Massimo Romani. Due to these projects I have learned a lot
beyond SNP genotyping.
The Church lab in Boston for the warm welcoming and the inspiring environment during my research exchange.
All my research colleagues and co-workers.
Previous members of the Molecular Medicine group. Lotta Rönn, Andreas
Dahlgren and especially Lovisa Lovmar, Mats Jonsson, Lili Milani, Cecilia
Oksanen and Snaevar Sigurdsson for the collaborations on the publications
included in this thesis. A special thank you goes to Raul Figueroa for endless
microarray spotting and practical help in the lab. Everyone at the SNP genotyping and sequencing platform as well as Lillebil Andersson. For the help
62
and the positive attitude towards everything. This goes also for all people at
the Department of Medical Sciences. You really brighten up the facilities.
The current members in our research group. Anna Kiialainen, Olof Karlberg
and Anders Lundmark for the collaboration on the publications. My fellow
PhD students Per Lundmark, Chuan Wang and Jessica Nordlund as well as
Sofia Nordman and Eva Berglund. You all contribute to the nice atmosphere
in our group and to the great discussions and the good times we have had.
Johanna Sandling for you help and for being a terrific colleague and friend
throughout our joint PhD-student journey.
Last but not least I want to thank all my dear friends and my dearest family.
For everything. All you interest over the years and you’re never ending support, also during the excellent thesis retreat.
63
References
Adams, M. D., S. E. Celniker, et al. The genome sequence of Drosophila
melanogaster. Science 287(5461): 2185-2195 (2000).
Ahmadian, A., B. Gharizadeh, et al. Genotyping by apyrase-mediated allele-specific
extension. Nucleic Acids Res 29(24): E121 (2001).
Aitichou, M., S. Saleh, et al. Dual-probe real-time PCR assay for detection of variola
or other orthopoxviruses with dried reagents. J Virol Methods 153(2): 190-195
(2008).
Albert, T. J., M. N. Molla, et al. Direct selection of human genomic loci by microarray
hybridization. Nat Methods 4(11): 903-905 (2007).
Alves, A. M. and F. J. Carr. Dot blot detection of point mutations with adjacently
hybridising synthetic oligonucleotide probes. Nucleic Acids Res 16(17): 8723
(1988).
Ashburner, M. and C. M. Bergman. Drosophila melanogaster: a case study of a model
genomic sequence and its consequences. Genome Res 15(12): 1661-1667 (2005).
Astbury, W. Nucleic acid. Symposia of the Society for Experimental Biology 1(66)
(1947).
Astier, Y., O. Braha, et al. Toward single molecule DNA sequencing: direct
identification of ribonucleoside and deoxyribonucleoside 5'-monophosphates by
using an engineered protein nanopore equipped with a molecular adapter. J Am
Chem Soc 128(5): 1705-1710 (2006).
Avery, O. T., C. M. Macleod, et al. Studies on the Chemical Nature of the Substance
Inducing Transformation of Pneumococcal Types : Induction of Transformation
by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J
Exp Med 79(2): 137-158 (1944).
Aziah, I., M. Ravichandran, et al. Amplification of ST50 gene using dry-reagentbased polymerase chain reaction for the detection of Salmonella typhi. Diagn
Microbiol Infect Dis 59(4): 373-377 (2007).
Ball, M. P., J. B. Li, et al. Targeted and genome-scale strategies reveal gene-body
methylation signatures in human cells. Nat Biotechnol 27(4): 361-368 (2009).
Baner, J., A. Isaksson, et al. Parallel gene analysis with allele-specific padlock probes
and tag microarrays. Nucleic Acids Res 31(17): e103 (2003).
Barrett, J. C., S. Hansoul, et al. Genome-wide association defines more than 30
distinct susceptibility loci for Crohn's disease. Nat Genet 40(8): 955-962 (2008).
Bell, P. A., S. Chaturvedi, et al. SNPstream UHT: ultra-high throughput SNP
genotyping for pharmacogenomics and drug discovery. Biotechniques Suppl: 7072, 74, 76-77 (2002).
Bennett, S. T. Solexa Ltd. Pharmacogenomics 5: 433-438 (2004).
Bennett, S. T., C. Barnes, et al. Toward the 1,000 dollars human genome.
Pharmacogenomics 6(4): 373-382 (2005).
Bentley, D. R., S. Balasubramanian, et al. Accurate whole human genome sequencing
using reversible terminator chemistry. Nature 456(7218): 53-59 (2008).
64
Berger, J., T. Suzuki, et al. Genetic mapping with SNP markers in Drosophila. Nat
Genet 29(4): 475-481 (2001).
Bodmer, W. and C. Bonilla. Common and rare variants in multifactorial susceptibility
to common diseases. Nat Genet 40(6): 695-701 (2008).
Braslavsky, I., B. Hebert, et al. Sequence information can be obtained from single
DNA molecules. Proc Natl Acad Sci U S A 100(7): 3960-3964 (2003).
Buetow, K. H., M. Edmonson, et al. High-throughput development and
characterization of a genomewide collection of gene-based single nucleotide
polymorphism markers by chip-based matrix-assisted laser desorption/ionization
time-of-flight mass spectrometry. Proc Natl Acad Sci U S A 98(2): 581-584
(2001).
Burdett, H. and M. van den Heuvel. Fruits and flies: a genomics perspective of an
invertebrate model organism. Brief Funct Genomic Proteomic 3(3): 257-266
(2004).
Celniker, S. E. and G. M. Rubin. The Drosophila melanogaster genome. Annu Rev
Genomics Hum Genet 4: 89-117 (2003).
Chee, M., R. Yang, et al. Accessing genetic information with high-density DNA
arrays. Science 274(5287): 610-614 (1996).
Chen, D., A. Ahlford, et al. High-resolution, high-throughput SNP mapping in
Drosophila melanogaster. Nat Methods 5(4): 323-329 (2008).
Chen, D., J. Berger, et al. FLYSNPdb: a high-density SNP database of Drosophila
melanogaster. Nucleic Acids Res 37(Database issue): D567-570 (2009).
Choi, M., U. I. Scholl, et al. Genetic diagnosis by whole exome capture and massively
parallel DNA sequencing. Proc Natl Acad Sci U S A 106(45): 19096-19101
(2009).
Cirulli, E. T. and D. B. Goldstein. Uncovering the roles of rare variants in common
disease through whole-genome sequencing. Nat Rev Genet 11(6): 415-425 (2010).
Cohen, S. N., A. C. Chang, et al. Construction of biologically functional bacterial
plasmids in vitro. Proc Natl Acad Sci U S A 70(11): 3240-3244 (1973).
Conrad, D. F., D. Pinto, et al. Origins and functional impact of copy number variation
in the human genome. Nature 464(7289): 704-712 (2010).
Craddock, N., M. E. Hurles, et al. Genome-wide association study of CNVs in 16,000
cases of eight common diseases and 3,000 shared controls. Nature 464(7289):
713-720 (2010).
Crick, F. Central dogma of molecular biology. Nature 227(5258): 561-563 (1970).
Dahl, F., J. Baner, et al. Circle-to-circle amplification for precise and sensitive DNA
analysis. Proc Natl Acad Sci U S A 101(13): 4548-4553 (2004).
Dahl, F., M. Gullberg, et al. Multiplex amplification enabled by selective
circularization of large sets of genomic DNA fragments. Nucleic Acids Res 33(8):
e71 (2005).
Dahl, F., J. Stenberg, et al. Multigene amplification and massively parallel sequencing
for cancer mutation discovery. Proc Natl Acad Sci U S A 104(22): 9387-9392
(2007).
Dean, F. B., S. Hosono, et al. Comprehensive human genome amplification using
multiple displacement amplification. Proc Natl Acad Sci U S A 99(8): 5261-5266
(2002).
Dressman, D., H. Yan, et al. Transforming single DNA molecules into fluorescent
magnetic particles for detection and enumeration of genetic variations. Proc Natl
Acad Sci U S A 100(15): 8817-8822 (2003).
Drmanac, R., A. B. Sparks, et al. Human genome sequencing using unchained base
reads on self-assembling DNA nanoarrays. Science 327(5961): 78-81 (2010).
65
Drossman, H., J. A. Luckey, et al. High-speed separations of DNA sequencing
reactions by capillary electrophoresis. Anal Chem 62(9): 900-903 (1990).
Dufva, M., J. Petersen, et al. Detection of mutations using microarrays of poly(C)10poly(T)10 modified DNA probes immobilized on agarose films. Anal Biochem
352(2): 188-197 (2006).
Fan, J. B., X. Chen, et al. Parallel genotyping of human SNPs using generic highdensity oligonucleotide tag arrays. Genome Res 10(6): 853-860 (2000).
Fan, J. B., A. Oliphant, et al. Highly parallel SNP genotyping. Cold Spring Harb Symp
Quant Biol 68: 69-78 (2003).
Fire, A., S. Xu, et al. Potent and specific genetic interference by double-stranded RNA
in Caenorhabditis elegans. Nature 391(6669): 806-811 (1998).
Focke, M., F. Stumpf, et al. Microstructuring of polymer films for sensitive
genotyping by real-time PCR on a centrifugal microfluidic platform. Lab Chip
(2010).
Fortini, M. E., M. P. Skupski, et al. A survey of human disease gene counterparts in
the Drosophila genome. J Cell Biol 150(2): F23-30 (2000).
Franklin, R. E. and R. G. Gosling. Evidence for 2-Chain Helix in Crystalline Structure
of Sodium Deoxyribonucleate. Nature 172(4369): 156-157 (1953).
Frazer, K. A., D. G. Ballinger, et al. A second generation human haplotype map of
over 3.1 million SNPs. Nature 449(7164): 851-861 (2007).
Fredriksson, S., J. Baner, et al. Multiplex amplification of all coding sequences within
10 cancer genes by Gene-Collector. Nucleic Acids Res 35(7): e47 (2007).
Fuller, C. W., L. R. Middendorf, et al. The challenges of sequencing by synthesis. Nat
Biotechnol 27(11): 1013-1023 (2009).
Ge, D., J. Fellay, et al. Genetic variation in IL28B predicts hepatitis C treatmentinduced viral clearance. Nature 461(7262): 399-401 (2009).
Gerry, N. P., N. E. Witowski, et al. Universal DNA microarray method for multiplex
detection of low abundance point mutations. J Mol Biol 292(2): 251-262 (1999).
Gharizadeh, B., T. Nordstrom, et al. Long-read pyrosequencing using pure 2'deoxyadenosine-5'-O'-(1-thiotriphosphate) Sp-isomer. Anal Biochem 301(1): 8290 (2002).
Gibbs, R. A., P. N. Nguyen, et al. Detection of single DNA base differences by
competitive oligonucleotide priming. Nucleic Acids Res 17(7): 2437-2448 (1989).
Gnirke, A., A. Melnikov, et al. Solution hybrid selection with ultra-long
oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27(2):
182-189 (2009).
Gunderson, K. L., S. Kruglyak, et al. Decoding randomly ordered DNA arrays.
Genome Res 14(5): 870-877 (2004).
Hardenbol, P., J. Baner, et al. Multiplexed genotyping with sequence-tagged
molecular inversion probes. Nat Biotechnol 21(6): 673-678 (2003).
Hardenbol, P., F. Yu, et al. Highly multiplexed molecular inversion probe genotyping:
over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 15(2):
269-275 (2005).
Hinds, D. A., L. L. Stuve, et al. Whole-genome patterns of common DNA variation in
three human populations. Science 307(5712): 1072-1079 (2005).
Hirschhorn, J. N. and M. J. Daly. Genome-wide association studies for common
diseases and complex traits. Nat Rev Genet 6(2): 95-108 (2005).
Hirschhorn, J. N., P. Sklar, et al. SBE-TAGS: an array-based method for efficient
single-nucleotide polymorphism genotyping. Proc Natl Acad Sci U S A 97(22):
12164-12169 (2000).
Hodges, E., Z. Xuan, et al. Genome-wide in situ exon capture for selective
resequencing. Nat Genet 39(12): 1522-1527 (2007).
66
Holland, P. M., R. D. Abramson, et al. Detection of specific polymerase chain
reaction product by utilizing the 5'----3' exonuclease activity of Thermus
aquaticus DNA polymerase. Proc Natl Acad Sci U S A 88(16): 7276-7280 (1991).
Horner, D. S., G. Pavesi, et al. Bioinformatics approaches for genomics and post
genomics applications of next-generation sequencing. Brief Bioinform 11(2): 181197 (2010).
Hoskins, R. A., A. C. Phan, et al. Single nucleotide polymorphism markers for genetic
mapping in Drosophila melanogaster. Genome Res 11(6): 1100-1113 (2001).
Howell, W. M., M. Jobs, et al. Dynamic allele-specific hybridization. A new method
for scoring single nucleotide polymorphisms. Nat Biotechnol 17(1): 87-88 (1999).
Hultin, E., M. Kaller, et al. Competitive enzymatic reaction to control allele-specific
extensions. Nucleic Acids Res 33(5): e48 (2005).
Jobs, M., W. M. Howell, et al. DASH-2: flexible, low-cost, and high-throughput SNP
genotyping by dynamic allele-specific hybridization on membrane arrays.
Genome Res 13(5): 916-924 (2003).
Kaderali, L., A. Deshpande, et al. Primer-design for multiplexed genotyping. Nucleic
Acids Res 31(6): 1796-1802 (2003).
Karas, M. and F. Hillenkamp. Laser desorption ionization of proteins with molecular
masses exceeding 10,000 daltons. Anal Chem 60(20): 2299-2301 (1988).
Kennedy, G. C., H. Matsuzaki, et al. Large-scale genotyping of complex DNA. Nat
Biotechnol 21(10): 1233-1237 (2003).
Kerem, B., J. M. Rommens, et al. Identification of the cystic fibrosis gene: genetic
analysis. Science 245(4922): 1073-1080 (1989).
Kim, S., M. E. Ulz, et al. Thirtyfold multiplex genotyping of the p53 gene using solid
phase capturable dideoxynucleotides and mass spectrometry. Genomics 83(5):
924-931 (2004).
Klatser, P. R., S. Kuijper, et al. Stabilized, freeze-dried PCR mix for detection of
mycobacteria. J Clin Microbiol 36(6): 1798-1800 (1998).
Korbel, J. O., A. E. Urban, et al. Paired-end mapping reveals extensive structural
variation in the human genome. Science 318(5849): 420-426 (2007).
Kwok, P. Y. Methods for genotyping single nucleotide polymorphisms. Annu Rev
Genomics Hum Genet 2: 235-258 (2001).
Landegren, U., R. Kaiser, et al. A ligase-mediated gene detection technique. Science
241(4869): 1077-1080 (1988).
Lander, E. S., L. M. Linton, et al. Initial sequencing and analysis of the human
genome. Nature 409(6822): 860-921 (2001).
Larsson, C., I. Grundberg, et al. In situ detection and genotyping of individual mRNA
molecules. Nat Methods 7(5): 395-397 (2010).
Leamon, J. H., W. L. Lee, et al. A massively parallel PicoTiterPlate based platform for
discrete picoliter-scale polymerase chain reactions. Electrophoresis 24(21): 37693777 (2003).
Lee, L. G., C. R. Connell, et al. Allelic discrimination by nick-translation PCR with
fluorogenic probes. Nucleic Acids Res 21(16): 3761-3766 (1993).
Levene, M. J., J. Korlach, et al. Zero-mode waveguides for single-molecule analysis at
high concentrations. Science 299(5607): 682-686 (2003).
Levene, P. The structure of yeast nucleic acid. The Journal of Biological Chemistry
40(2): 415-424 (1919).
Levy, S., G. Sutton, et al. The diploid genome sequence of an individual human. PLoS
Biol 5(10): e254 (2007).
Li, J. B., Y. Gao, et al. Multiplex padlock targeted sequencing reveals human
hypermutable CpG variations. Genome Res 19(9): 1606-1615 (2009).
67
Li, J. B., E. Y. Levanon, et al. Genome-wide identification of human RNA editing
sites by parallel DNA capturing and sequencing. Science 324(5931): 1210-1213
(2009).
Lindroos, K., S. Sigurdsson, et al. Multiplex SNP genotyping in pooled DNA samples
by a four-colour microarray system. Nucleic Acids Res 30(14): e70 (2002).
Liu, P. and R. A. Mathies. Integrated microfluidic systems for high-performance
genetic analysis. Trends Biotechnol 27(10): 572-581 (2009).
Liu, W. M., X. Di, et al. Algorithms for large-scale genotyping microarrays.
Bioinformatics 19(18): 2397-2403 (2003).
Livak, K. J., S. J. Flood, et al. Oligonucleotides with fluorescent dyes at opposite ends
provide a quenched probe system useful for detecting PCR product and nucleic
acid hybridization. PCR Methods Appl 4(6): 357-362 (1995).
Lovmar, L., A. Ahlford, et al. Silhouette scores for assessment of SNP genotype
clusters. BMC Genomics 6(1): 35 (2005).
Lovmar, L., M. Fredriksson, et al. Quantitative evaluation by minisequencing and
microarrays reveals accurate multiplexed SNP genotyping of whole genome
amplified DNA. Nucleic Acids Res 31(21): e129 (2003).
Lovmar, L. and A. C. Syvänen. Multiple displacement amplification to create a longlasting source of DNA for genetic studies. Hum Mutat 27(7): 603-614 (2006).
Lyamichev, V., A. L. Mast, et al. Polymorphism identification and quantitative
detection of genomic DNA by invasive cleavage of oligonucleotide probes. Nat
Biotechnol 17(3): 292-296 (1999).
Macdonald, S. J., T. Pastinen, et al. A low-cost open-source SNP genotyping platform
for association mapping applications. Genome Biol 6(12): R105 (2005).
Mamanova, L., A. J. Coffey, et al. Target-enrichment strategies for next-generation
sequencing. Nat Methods 7(2): 111-118 (2010).
Manolio, T. A., F. S. Collins, et al. Finding the missing heritability of complex
diseases. Nature 461(7265): 747-753 (2009).
Mardis, E. R. The impact of next-generation sequencing technology on genetics.
Trends Genet 24(3): 133-141 (2008).
Mardis, E. R. Next-generation DNA sequencing methods. Annu Rev Genomics Hum
Genet 9: 387-402 (2008).
Margulies, M., M. Egholm, et al. Genome sequencing in microfabricated high-density
picolitre reactors. Nature 437(7057): 376-380 (2005).
Martin, S. G., K. C. Dobi, et al. A rapid method to map mutations in Drosophila.
Genome Biol 2(9): RESEARCH0036 (2001).
Matsuzaki, H., H. Loi, et al. Parallel genotyping of over 10,000 SNPs using a oneprimer assay on a high-density oligonucleotide array. Genome Res 14(3): 414-425
(2004).
Mattick, J. S., R. J. Taft, et al. A global view of genomic information--moving beyond
the gene and the master regulator. Trends Genet 26(1): 21-28 (2010).
Maxam, A. M. and W. Gilbert. A new method for sequencing DNA. Proc Natl Acad
Sci U S A 74(2): 560-564 (1977).
Melamede, R. J. (1985). Automatable process for sequencing nucleotide US, New
York Medical College (Valhalla, NY)
Metzker, M. L. Sequencing technologies — the next generation. Nature Reviews
Genetics 11(1): 31-46 (2010).
Michael, K. L., L. C. Taylor, et al. Randomly ordered addressable high-density optical
sensor arrays. Anal Chem 70(7): 1242-1248 (1998).
Milani, L., M. Gupta, et al. Allelic imbalance in gene expression as a guide to cisacting regulatory single nucleotide polymorphisms in cancer cells
10.1093/nar/gkl1152. Nucl. Acids Res.: gkl1152 (2007).
68
Mitani, Y., A. Lezhava, et al. Rapid SNP diagnostics using asymmetric isothermal
amplification and a new mismatch-suppression technology. Nat Methods 4(3):
257-262 (2007).
Mitra, R. D. and G. M. Church. In situ localized amplification and contact replication
of many individual DNA molecules. Nucleic Acids Res 27(24): e34 (1999).
Moretti, T. R., A. L. Baumstark, et al. Validation of short tandem repeats (STRs) for
forensic usage: performance testing of fluorescent multiplex STR systems and
analysis of authentic and simulated forensic samples. J Forensic Sci 46(3): 647660 (2001).
Moriyama, E. N. and J. R. Powell. Intraspecific nuclear DNA variation in Drosophila.
Mol Biol Evol 13(1): 261-277 (1996).
Mullis, K., F. Faloona, et al. Specific enzymatic amplification of DNA in vitro: the
polymerase chain reaction. Cold Spring Harb Symp Quant Biol 51(Pt 1): 263-273.
(1986).
Need, A. C., D. Ge, et al. A genome-wide investigation of SNPs and CNVs in
schizophrenia. PLoS Genet 5(2): e1000373 (2009).
Neely, G. G., K. Kuba, et al. A global in vivo Drosophila RNAi screen identifies
NOT3 as a conserved regulator of heart function. Cell 141(1): 142-153 (2010).
Nejentsev, S., N. Walker, et al. Rare variants of IFIH1, a gene implicated in antiviral
responses, protect against type 1 diabetes. Science 324(5925): 387-389 (2009).
Nilsson, M., H. Malmgren, et al. Padlock probes: circularizing oligonucleotides for
localized DNA detection. Science 265(5181): 2085-2088 (1994).
Nyren, P., B. Pettersson, et al. Solid phase DNA minisequencing by an enzymatic
luminometric inorganic pyrophosphate detection assay. Anal Biochem 208(1):
171-175 (1993).
Ohnishi, Y., T. Tanaka, et al. A high-throughput SNP typing system for genome-wide
association studies. J Hum Genet 46(8): 471-477 (2001).
Ohno, K., K. Tachikawa, et al. Microfluidics: applications for analytical purposes in
chemistry and biochemistry. Electrophoresis 29(22): 4443-4453 (2008).
Okou, D. T., K. M. Steinberg, et al. Microarray-based genomic selection for highthroughput resequencing. Nat Methods 4(11): 907-909 (2007).
Oliphant, A., D. L. Barker, et al. BeadArray technology: enabling an accurate, costeffective approach to high-throughput genotyping. Biotechniques Suppl: 56-58,
60-51 (2002).
Olivier, M. The Invader assay for SNP genotyping. Mutat Res 573(1-2): 103-110
(2005).
Pal-Bhadra, M., U. Bhadra, et al. Cosuppression in Drosophila: gene silencing of
Alcohol dehydrogenase by white-Adh transgenes is Polycomb dependent. Cell
90(3): 479-490 (1997).
Pang, A. W., J. R. MacDonald, et al. Towards a comprehensive structural variation
map of an individual human genome. Genome Biol 11(5): R52 (2010).
Parks, A. L., K. R. Cook, et al. Systematic generation of high-resolution deletion
coverage of the Drosophila melanogaster genome. Nat Genet 36(3): 288-292
(2004).
Pastinen, T., A. Kurg, et al. Minisequencing: a specific tool for DNA analysis and
diagnostics on oligonucleotide arrays. Genome Res 7(6): 606-614. (1997).
Pastinen, T., J. Partanen, et al. Multiplex, fluorescent, solid-phase minisequencing for
efficient screening of DNA sequence variation. Clin Chem 42(9): 1391-1397.
(1996).
Pastinen, T., M. Raitio, et al. A system for specific, high-throughput genotyping by
allele-specific primer extension on microarrays. Genome Res 10(7): 1031-1042.
(2000).
69
Pease, A. C., D. Solas, et al. Light-generated oligonucleotide arrays for rapid DNA
sequence analysis. Proc Natl Acad Sci U S A 91(11): 5022-5026 (1994).
Petitjean, A., M. I. Achatz, et al. TP53 mutations in human cancers: functional
selection and impact on cancer prognosis and outcomes. Oncogene 26(15): 21572165 (2007).
Petitjean, A., E. Mathe, et al. Impact of mutant p53 functional properties on TP53
mutation patterns and tumor phenotype: lessons from recent developments in the
IARC TP53 database. Hum Mutat 28(6): 622-629 (2007).
Pop, M. and S. L. Salzberg. Bioinformatics challenges of new sequencing technology.
Trends Genet 24(3): 142-149 (2008).
Porreca, G. J., K. Zhang, et al. Multiplex amplification of large sets of human exons.
Nat Methods 4(11): 931-936 (2007).
Prober, J. M., G. L. Trainor, et al. A system for rapid DNA sequencing with
fluorescent chain-terminating dideoxynucleotides. Science 238(4825): 336-341
(1987).
R Development Core Team. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing. ISBN 3-900051-07-0 (2010).
Redon, R., S. Ishikawa, et al. Global variation in copy number in the human genome.
Nature 444(7118): 444-454 (2006).
Reiter, L. T., L. Potocki, et al. A systematic analysis of human disease-associated gene
sequences in Drosophila melanogaster. Genome Res 11(6): 1114-1125 (2001).
Riordan, J. R., J. M. Rommens, et al. Identification of the cystic fibrosis gene: cloning
and characterization of complementary DNA. Science 245(4922): 1066-1073
(1989).
Ronaghi, M., S. Karamohamed, et al. Real-time DNA sequencing using detection of
pyrophosphate release. Anal Biochem 242(1): 84-89 (1996).
Ronaghi, M., M. Uhlen, et al. A sequencing method based on real-time
pyrophosphate. Science 281(5375): 363, 365 (1998).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. Journal of Computational and Applied Mathematics 20: 53-65
(1987).
Ryder, E., F. Blows, et al. The DrosDel collection: a set of P-element insertions for
generating custom chromosomal aberrations in Drosophila melanogaster.
Genetics 167(2): 797-813 (2004).
Sachidanandam, R., D. Weissman, et al. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. Nature 409(6822): 928933 (2001).
Sanger, F., S. Nicklen, et al. DNA sequencing with chain-terminating inhibitors. Proc
Natl Acad Sci U S A 74(12): 5463-5467 (1977).
Schmidtke, J., B. Pape, et al. Restriction fragment length polymorphisms at the human
parathyroid hormone gene locus. Hum Genet 67(4): 428-431 (1984).
Schnorrer, F., I. Kalchhauser, et al. The transmembrane protein Kon-tiki couples to
Dgrip to mediate myotube targeting in Drosophila. Dev Cell 12(5): 751-766
(2007).
Sebat, J., B. Lakshmi, et al. Large-scale copy number polymorphism in the human
genome. Science 305(5683): 525-528 (2004).
Shendure, J. and H. Ji. Next-generation DNA sequencing. Nat Biotechnol 26(10):
1135-1145 (2008).
Shendure, J., G. J. Porreca, et al. Accurate multiplex polony sequencing of an evolved
bacterial genome. Science 309(5741): 1728-1732 (2005).
Sherry, S. T., M. H. Ward, et al. dbSNP: the NCBI database of genetic variation.
Nucleic Acids Res 29(1): 308-311 (2001).
70
Shumaker, J. M., A. Metspalu, et al. Mutation detection by solid phase primer
extension. Hum Mutat 7(4): 346-354 (1996).
Smith, L. M., J. Z. Sanders, et al. Fluorescence detection in automated DNA sequence
analysis. Nature 321(6071): 674-679 (1986).
Southern, E. M., U. Maskos, et al. Analyzing and comparing nucleic acid sequences
by hybridization to arrays of oligonucleotides: evaluation using experimental
models. Genomics 13(4): 1008-1017 (1992).
Spradling, A. C., D. Stern, et al. The Berkeley Drosophila Genome Project gene
disruption project: Single P-element insertions mutating 25% of vital Drosophila
genes. Genetics 153(1): 135-177 (1999).
St Johnston, D. The art and design of genetic screens: Drosophila melanogaster. Nat
Rev Genet 3(3): 176-188 (2002).
Stankiewicz, P. and J. R. Lupski. Structural variation in the human genome and its
role in disease. Annu Rev Med 61: 437-455 (2010).
Steemers, F. J. and K. L. Gunderson. Whole genome genotyping technologies on the
BeadArray platform. Biotechnol J 2(1): 41-49 (2007).
Summerer, D. Enabling technologies of genomic-scale sequence enrichment for
targeted high-throughput sequencing. Genomics 94(6): 363-368 (2009).
Summerer, D., D. Hevroni, et al. A flexible and fully integrated system for
amplification, detection and genotyping of genomic DNA targets based on
microfluidic oligonucleotide arrays. N Biotechnol 27(2): 149-155 (2010).
Summerer, D., N. Schracke, et al. Targeted high throughput sequencing of a cancerrelated exome subset by specific sequence capture with a fully automated
microarray platform. Genomics 95(4): 241-246 (2010).
Syvänen, A. C. Accessing genetic variation: genotyping single nucleotide
polymorphisms. Nat Rev Genet 2(12): 930-942 (2001).
Syvänen, A. C. Toward genome-wide SNP genotyping. Nat Genet 37 Suppl: S5-10
(2005).
Syvänen, A. C., K. Aalto-Setala, et al. A primer-guided nucleotide incorporation assay
in the genotyping of apolipoprotein E. Genomics 8(4): 684-692 (1990).
Syvänen, A. C., A. Sajantila, et al. Identification of individuals by analysis of biallelic
DNA markers, using PCR and solid-phase minisequencing. Am J Hum Genet
52(1): 46-59. (1993).
Tabor, S. and C. C. Richardson. A single residue in DNA polymerases of the
Escherichia coli DNA polymerase I family is critical for distinguishing between
deoxy- and dideoxyribonucleotides. Proc Natl Acad Sci U S A 92(14): 6339-6343
(1995).
Tam, G. W., R. Redon, et al. The role of DNA copy number variation in
schizophrenia. Biol Psychiatry 66(11): 1005-1012 (2009).
Teague, B., M. S. Waterman, et al. High-resolution human genome structure by
single-molecule analysis. Proc Natl Acad Sci U S A 107(24): 10848-10853
(2010).
Tewhey, R., J. B. Warner, et al. Microdroplet-based PCR enrichment for large-scale
targeted sequencing. Nat Biotechnol 27(11): 1025-1031 (2009).
The International HapMap Consortium, P. A haplotype map of the human genome.
Nature 437(7063): 1299-1320 (2005).
Tjio, J. and A. Levan. The chromosome number of man. Hereditas 42: 1-6 (1956).
Toth, G., Z. Gaspari, et al. Microsatellites in different eukaryotic genomes: survey and
analysis. Genome Res 10(7): 967-981 (2000).
Turner, E. H., S. B. Ng, et al. Methods for genomic partitioning. Annu Rev Genomics
Hum Genet 10: 263-284 (2009).
71
Tweedie, S., M. Ashburner, et al. FlyBase: enhancing Drosophila Gene Ontology
annotations. Nucleic Acids Res 37(Database issue): D555-559 (2009).
Tyagi, S., D. P. Bratu, et al. Multicolor molecular beacons for allele discrimination.
Nat Biotechnol 16(1): 49-53. (1998).
Tyagi, S. and F. R. Kramer. Molecular beacons: probes that fluoresce upon
hybridization. Nat Biotechnol 14(3): 303-308 (1996).
Vallone, P. M. and J. M. Butler. AutoDimer: a screening tool for primer-dimer and
hairpin structures. Biotechniques 37(2): 226-231 (2004).
Venken, K. J. and H. J. Bellen. Emerging technologies for gene manipulation in
Drosophila melanogaster. Nat Rev Genet 6(3): 167-178 (2005).
Venter, J. C., M. D. Adams, et al. The sequence of the human genome. Science
291(5507): 1304-1351. (2001).
Vos, P., R. Hogers, et al. AFLP: a new technique for DNA fingerprinting. Nucleic
Acids Res 23(21): 4407-4414 (1995).
Wallace, R. B., J. Shaffer, et al. Hybridization of synthetic oligodeoxyribonucleotides
to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Res
6(11): 3543-3557 (1979).
Wang, J., W. Wang, et al. The diploid genome sequence of an Asian individual.
Nature 456(7218): 60-65 (2008).
Watson, J. D. and F. H. Crick. A structure for deoxyribose nucleic acid. Nature
171(4361): 964-967 (1953).
Weigl, B., G. Domingo, et al. Towards non- and minimally instrumented,
microfluidics-based diagnostic devices. Lab Chip 8(12): 1999-2014 (2008).
Werner, J. H., H. Cai, et al. Progress towards single-molecule DNA sequencing: a one
color demonstration. J Biotechnol 102(1): 1-14 (2003).
Wheeler, D. A., M. Srinivasan, et al. The complete genome of an individual by
massively parallel DNA sequencing. Nature 452(7189): 872-876 (2008).
White, R. A., 3rd, P. C. Blainey, et al. Digital PCR provides sensitive and absolute
calibration for high throughput sequencing. BMC Genomics 10: 116 (2009).
Whitesides, G. M. The origins and the future of microfluidics. Nature 442(7101): 368373 (2006).
Wu, D. Y., L. Ugozzoli, et al. Allele-specific enzymatic amplification of beta-globin
genomic DNA for diagnosis of sickle cell anemia. Proc Natl Acad Sci U S A
86(8): 2757-2760 (1989).
72