“It will take years, perhaps decades, to construct a detailed theory
that explains how DNA, RNA and the epigenetic machinery all fit
into an interlocking, self- regulating system. But there is no longer
any doubt that a new theory is needed to replace the central dogma
that has been the foundation of molecular genetics and biotechnology since the 1950s.”
— W. Wayt Gibbs
To my family
List of Papers
This thesis is based on the following papers, which are referred to in the text
by their Roman numerals.
I
†Kyriakopoulou, C., †Larsson, P., †Liu, L., †Schuster, J., Söderbom, F.,
Kirsebom, L. A., Virtanen, A. (2006) U1-like snRNAs lacking complementarity to canonical 5’ splice sites. RNA, 12(9):1603-1611
II Conze, L. L., Green, R. E., Meyer, M., Briggs, A., Schuster, J., Virtanen, A. (2012) Construction and deep sequencing of cDNA libraries
representing medium-sized RNA. Manuscript.
III Conze, L. L., Green, R. E., Larsson, P., Schuster, J., Meyer, M., Briggs,
A., Virtanen, A. (2012) The primate spliceosomal snRNA transcriptome
as revealed by deep sequencing. Manuscript.
†These authors contributed equally.
Reprints were made with permission from the respective publishers.
Contents
INTRODUCTION ........................................................................................ 11 Evolution of the gene concept ................................................................. 11 Current definition of a gene ..................................................................... 13 Problematic issues in gene definition ...................................................... 13 Regulatory elements............................................................................ 13 Splicing ............................................................................................... 14 Non-coding RNAs .............................................................................. 15 Noise in biological systems ................................................................ 16 Remarks on gene identification ............................................................... 18 AIM OF THE STUDY ................................................................................. 19 Noise in biological systems, trash or treasure?........................................ 19 Non-protein-coding RNAs in evolution .................................................. 20 SPLICEOSOMAL SNRNAS ....................................................................... 21 Components of spliceosomal snRNAs .................................................... 21 The roles of spliceosomal snRNAs in splicing ........................................ 22 Splice site recognition ......................................................................... 22 Catalysis of splicing ............................................................................ 23 APPLICATION OF NEXT-GENERATION SEQUENCING .................... 26 Next-generation sequencing technologies ............................................... 26 RNA sequencing: opportunities and challenges ...................................... 27 PRESENT INVESTIGATIONS .................................................................. 28 The estimated number of spliceosomal snRNA genes (I) ....................... 28 The biochemical approach to identification of spliceosomal snRNA
variants (I)................................................................................................ 29 The high-throughput approach in profiling the spliceosomal snRNA
populations (II, III) .................................................................................. 30 Development of the spliceosomal snRNA cDNA library preparation
method for 454 sequencing (II)........................................................... 31 Spliceosomal snRNA profiling (III) ................................................... 31 CONCLUDING REMARKS ....................................................................... 34 SVENSK SAMMANFATTNING ............................................................... 36 ACKNOWLEDGEMENTS ......................................................................... 39 REFERENCES ............................................................................................. 41 Abbreviations
A
AS
BLAST
BLAT
BPS
C
cDNA
DNA
ENCODE
G
HGNC
I
NGS
RNA
lncRNA
miRNA
mRNA
ncRNA
piRNA
rRNA
siRNA
snoRNA
snRNA
tRNA
RNase P
SNP
SS
U
U snRNA
adenosine
alternative splicing
the Basic Local Alignment Search Tool
the BLAST-Like Alignment Tool
branch point sequences
cytidine
complementary DNA
deoxyribonucleic acid
the ENCyclopedia Of DNA Elements
guanosine
the HUGO Gene Nomenclature Committee
inosine
next-generation sequencing
ribonucleic acid
long non-coding RNA
micro RNA
messenger RNA
non-coding RNA
piwi-interacting RNA
ribosomal RNA
small interfering RNA
small nucleolar RNA
small nuclear RNA
transfer RNA
ribonuclease P
single-nucleotide polymorphism
splice site
uridine
uridine-rich small nuclear RNA (spliceosomal snRNA)
INTRODUCTION
The gene has been an essential concept in biology for more than a century.
The view of a gene has changed along with the big breakthroughs in the life
sciences. The evolution of the gene concept from simply being inheritance
material to something complicated and confusing reflects the ever-expanding
knowledge and discoveries in the field.
Evolution of the gene concept
The term gene derives from Greek genesis (“birth”) or genos (“origin”). It
was first introduced by Wilhelm Johannsen in 1909 in his book Elemente der
exakten Erblichkeitslehre (Johannsen 1909). However, the development of
the gene concept already dates back to the 1860s with Gregor Mendel’s
study of crossing pea plants (Pisum sativum), in which he demonstrated that
the “inheritable factor” for traits such as flower color or seed shape did not
appear deleted or “blended” in their offspring. Rather, the traits were passed
to the next generation of pea plants as intact and distinct entities (Mendel
1866).
In the beginning of the 1910s, the chromosome theory of inheritance began to take form (Sutton 1903; Boveri 1904) and Gregor Mendel’s “inheritable factors” were assigned physical locations. The first solid evidence that
associated a gene with a distinct locus on a chromosome was from the work
of the American geneticist Thomas Hunt Morgan and his students. They
demonstrated that the inheritance of a mutation in the eye-color of fruit flies
(Drosophila melanogaster) is correlated with the transmission of the X sex
chromosome (Morgan 1910).
In 1940s, the “one-gene, one-enzyme” or later “one gene, one protein”
hypothesis was developed. It is the idea that each gene is responsible for
producing a single, specific enzyme or functional protein in the cell. This
concept was first proposed by George Beadle and Edward Tatum from a
metabolism study on Neurospora crassa. They discovered that mutations in
the genes could affect steps in the metabolic pathways (Beadle and Tatum
1941).
Many great discoveries between the 1920s and 1950s contributed to the
definition of the gene as a physical molecule. Hermann Müller first demonstrated that X-ray radiation could cause mutations of the gene (Muller 1927).
11
Then Griffith’s experiment (Griffith 1928) provided further evidence by
showing that something in the dead pathogenic Streptococcus pneumonia
could be taken up by live non-pathogenic Streptococcus pneumonia which
rendered them pathogenic. This process is known as transformation. Later,
Avery, MacLeod and McCary showed that the genetic material could be
digested by the enzyme DNase, suggesting that transformation is induced by
DNA (Avery et al. 1944). In 1952, Hershey and Chase determined that
DNA, not protein, was the genetic material (Hershey and Chase 1952). Finally, the solution of the DNA double helix structure by Watson and Crick in
1953 (Watson and Crick 1953) explained how DNA functioned as the molecule for heredity.
Since then, the field of molecular biology has progressed rapidly. In 1958,
Francis Crick first proposed the concept known as the “Central Dogma” of
molecular biology (Figure 1). The Central Dogma states that the flow of
genetic information is from nucleic acid (DNA or RNA) to protein (Crick
1958; Crick 1970). The definition of a gene started to shift towards one that
viewed it as a transcribed code. Several important works that contributed to
this definition need to be mentioned here. The first is the theory developed
by Jacob and Monod about the operon in bacterial gene control (Jacob et al.
1960). The second is the deciphering of the first genetic code by Marshall
Nirenberg and his colleagues (Nirenberg and Matthaei 1961). The last is the
determination of the first nucleotide sequence of a gene, i.e. the bacteriophage MS2 coat protein (Min Jou et al. 1972).
Figure 1. The central dogma of molecular biology. This diagram shows the flow of
the genetic information described by Crick in 1958.
The discovery of introns and the mechanism of RNA splicing by Phillip
Sharp and Richard Roberts (Berget et al. 1977; Chow et al. 1977), together
with the development of cloning and sequencing techniques, have greatly
expanded the knowledge about gene organization and expression. The gene
definition has thus broadened to include its role as an annotated open reading
frame (ORF) in the genome (Doolittle 1986).
12
Current definition of a gene
According to the HUGO Gene Nomenclature Committee (HGNC), a gene is
“a DNA segment that contributes to phenotype/function. In the absence of
demonstrated function, a gene may be characterized by sequence, transcription or homology” (Wain et al. 2002). This definition of the gene has become prevalent because of events such as the human genome draft, the development of next-generation sequencing technologies and the rise of bioinformatics, especially in gene prediction. This current definition of a gene is
still based on the sequence view and is used for genome annotation by scientific organizations, such as the ENCODE project (ENCODE Project
Consortium 2004).
Problematic issues in gene definition
However, there are many problematic issues with the current definition of a
gene (Gerstein et al. 2007). They can be summarized into the following
groups:
Regulatory elements
The expression of a gene is heavily regulated by many factors. These factors
can be DNA, RNA or protein.
The typical DNA sequences that regulate gene expression are promoters
and operators. Promoters reside in the 5’ flanking region of the coding sequence and recruit RNA polymerases to initiate transcription. A classic example of an operator was shown by Jacob and Monod in the lac operon of
Escherichia coli, where the operator lies between the promoter and the genes
of the operon, and controls the access of RNA polymerase to the genes
(Jacob et al. 1960). There are also DNA regulatory elements that reside either within the coding sequences or distant from the coding sequences. Enhancers are an example of the latter. These regulatory elements bind with a
set of proteins that enhance the transcriptional level of the gene. The enhancers sometimes do not even need to be located on the same chromosome as
the gene that it acts on (Spilianakis et al. 2005).
The role of protein as factor in gene regulation has been widely recognized and well studied. Recently non-coding RNAs (ncRNAs) have also
been shown to be implicated in gene expression control. Among them are the
small non-coding RNAs (~20 nucleotides), such as small interfering RNAs
(siRNA), microRNAs (miRNA) and Piwi-interacting RNAs (piRNA). These
bind to target messenger RNAs (mRNAs) via base pairing, resulting in translational repression, mRNA degradation and gene silencing (Hamilton and
Baulcombe 1999; Bartel 2009). However, the long non-coding RNAs
13
(lncRNA), which have more than 200 nucleotides, are not as well studied as
small non-coding RNAs. Those that are well-characterized are those that
regulate gene expression by diverse mechanisms (Wang and Chang 2011),
e.g., transcriptional gene silencing via regulation of the chromatin structure
(Morris 2009; Whitehead et al. 2009).
Other current definitions of the gene have taken the regulation elements
into consideration. For example, in one textbook, the gene is described in
molecular terms as “the entire DNA sequence—including exons, introns, and
noncoding transcription control regions—necessary for production of a functional protein or RNA.” (Lodish et al. 2000). If the definition of a gene includes all the regulatory elements involved in the production of proteins/RNAs, then the content of a gene becomes far more complicated than
just DNA sequences with coding and flanking control regions—it has to
include distant DNA regulatory elements as well as non-coding RNAs and
protein factors.
Splicing
The discovery of splicing in 1977 (Berget et al. 1977; Chow et al. 1977)
showed that eukaryotes have a far more complicated genetic organization
and information flow than prokaryotes (Figure 2). A gene is not simply
composed of a protein-coding region with flanking control regions; it also
has many protein-coding regions (exons) which are separated by long nonprotein-coding regions (introns).
Figure 2. The flow of genetic information in eukaryotes and prokaryotes. (A) Eukaryotes. (B) Prokaryotes.
14
Alternative splicing
During RNA splicing, the exons can be connected in multiple ways and even
the exon sequences can be re-defined (Figure 3). This process is called alternative splicing. The prevalence of alternative splicing is closely correlated
with the eukaryotic evolutionary tree (Keren et al. 2010). In humans, all the
genes that contain multiple exons can give rise to alternatively spliced
mRNAs (Pan et al. 2008; Wang et al. 2008).
Figure 3. Alternative splicing patterns. (A) Cassette-exon inclusion or skipping. (B)
Alternative 5’ or 3’ splice site. (C) Intron retention.
Alternative splicing can generate several products from one genetic locus,
which means that the information in DNA is not linearly related to that on
the protein. For instance, when alternative splicing has shifted the reading
frame, the same gene locus can produce proteins with no amino acid sequences in common.
Trans-splicing
Normally the splicing process only involves a single RNA molecule and is
referred to as cis-splicing. However there are examples in eukaryotes where
exons from different RNA transcripts are joined together (Lasda and
Blumenthal 2011). This process is called trans-splicing and it immediately
challenged the classical concept of the gene as “a locus”.
Non-coding RNAs
Non-coding RNAs (ncRNAs) are transcripts that do not code for proteins.
They make up 97-98% of the transcriptional output of the human genome
(Mattick 2001). However, the known ncRNAs make up only a small portion
of the total population of non-protein-coding RNA transcripts. Their functions are very diverse and include gene regulation (e.g., miRNAs), splicing
(U snRNAs), RNA processing (e.g., snoRNAs) and protein synthesis
(tRNAs and rRNAs) (Mattick and Makunin 2006).
15
The existence of ncRNA instantly created an exception to the central
dogma of molecular biology that states that genetic information flows from
DNA to RNA to protein. It is now clear that, at least in higher organisms, it
is incorrect to assume that most genetic information is expressed as proteins.
Interestingly, there is evidence that the number of ncRNAs correlates with
the complexity of an organism, i.e. the more complex the organism, the more
ncRNAs it possesses (Taft et al. 2009). In addition, the distribution of
ncRNA genes in the genome and their overlap with protein-coding genes
have also collided with the one-gene, one-locus concept.
Noise in biological systems
The biological systems in a cell are intricate and complex, and though accurate, never perfect. Hence gene or gene expression variations within populations or even cell-to-cell can be observed. In addition, the cell also retains
large numbers of DNA sequences that do not have any obvious functions.
All of these phenomena are considered as “noise” in the biological systems
of the cell and they raise different problematic issues regarding the gene
definition.
Genetic variation
Genetic variation occurs on many different scales, ranging from gross alterations in the karyotype to single nucleotide changes. Single nucleotide changes that occur in at least 1% of the population are called single-nucleotide
polymorphisms (SNPs). SNPs occur in most regions of the genome, including coding/non-coding regions and intergenic regions. Even though they are
neutral in most cases, some can affect gene expression seriously by producing different proteins or suppressing the production of the original proteins.
The copy number of genes can also differ between individuals (Sebat et al.
2004). This implies that the genetic elements may not be constant in their
location and number. Epigenetic modification, e.g. DNA methylation and
histone modifications, is another type of genetic variation; in these modifications phenotype is not determined strictly by genotype (Portela and Esteller
2010).
Transcriptional variation
The majority of transcriptional variation is caused by alternative splicing
events, where different isoforms of mRNA are produced in individuals
(Wang et al. 2008). RNA editing (the process in which the information content of an RNA molecule is altered through a chemical change in the base)
can generate variation as well (Sie and Kuchka 2011). RNA editing has been
observed in most RNA molecules of eukaryotes, including tRNAs, rRNAs,
mRNAs and miRNAs. The most frequent RNA editing events are cytidine
(C) to uridine (U) and adenosine (A) to inosine (I) deaminations.
16
During transcription, RNA polymerases may make mistakes resulting in
incorrect initiation, misreading and stuttering. These incorrectly generated
RNA molecules are referred to as transcriptional noise, which is known to
play an important role in the diversity of genetically identical cells and organisms (Ramsey et al. 2006).
Transcriptional variation generates multiple products of proteins/RNAs
from one genetic locus. These products sometimes do not contain the same
information as the DNA sequence, which implies that the information on the
DNA is not always preserved 100% in RNA sequences.
Translational variations
Some post-translational events, including protein splicing and protein modification, produce proteins without directly receiving the information from
the DNA (Villa-Komaroff et al. 1975; Wold 1981).
Junk DNA
The term Junk DNA (Ohno 1972) is used for describing the portions of a
genome sequence for which a function is not known. Junk DNA received
attention after the sequencing of the human genome, where it was shown that
only 2.2% of the genome codes for functional products (Frith et al. 2005).
The human genome does not have many more protein-coding genes
(~20,000) (International Human Genome Sequencing Consortium 2004) than
the nematode worm (~19,000) (Stein et al. 2003). Therefore, it is likely that
the developmental complexity of humans resides in the 98% “junk DNA” in
our genome (Pheasant and Mattick 2007).
The concept of “junk DNA” is now undergoing revision. It has been discovered that much of the junk DNA actually is transcribed as non-coding
RNA. These non-coding RNAs have important functions in the cell
(Willingham and Gingeras 2006; Ponicsan et al. 2010; Pink et al. 2011).
Pseudogenes and retrogenes
Pseudogenes i.e. dysfunctional copies of their relative genes have also long
been considered to be “junk” in the genome. However, recent studies have
shown that many pseudogenes are transcribed and some even exhibit tissue
specific patterns (Pink et al. 2011).
A retrogene is formed via reverse transcription of the mature RNA product from its parent genes after which the DNA product is inserted into the
genome (Vanin et al. 1980). The genetic information flows from RNA to
DNA, in contradiction with both the definition of a gene and the central
dogma of molecular biology.
17
Remarks on gene identification
Two important questions should be considered while identifying genes. They
are: “What is a gene?” and “What are experimental artifacts?”. There is still
no clear answer to the first question, but an awareness of the problematic
issues surrounding the definition is a good start for the planning of gene
identification. Nowadays, a combination of both computational and biochemical approaches is often used in gene identification. Since many different methods are available for the two approaches, it is beneficial if the potential experimental artifacts and the feasibility of the methods are analyzed
thoroughly before actually conducting the experiment.
Furthermore, keeping these two questions in mind facilitates the analysis
of the data so that any conclusions about the newly identified genes are unbiased and accurate.
18
AIM OF THE STUDY
After the genetic code was deciphered, it was in hindsight naively believed
that a clear answer regarding cell activities would emerge once the whole
human genome was sequenced. However, it became clear from the sequenced human genome that the cell system is far more complicated and
complex than ever imagined. We are just standing at the beginning of understanding complex network.
One of the biggest surprises revealed by sequencing the human genome is
the vast number of non-protein-coding RNAs (ncRNAs) in the cell. It has
been estimated that they make up ~98% of the total transcriptional outputs in
the human genome (Mattick 2001). They are expressed from the loci that
were long considered “junk” in the genome. However, more and more evidence shows functionalities of these non-protein-coding RNAs and their
number seems to correlate with the complexity of an organism (Taft et al.
2009). It has become clear that ncRNAs play an important regulatory role in
the process of protein expression.
Another surprise is that humans do not have many more protein-coding
genes than a nematode worm. It turns out that the diversity of proteins is due
to a process called alternative splicing by which multiple protein products
can be generated from one mRNA transcript or one gene locus. Recent studies indicate that almost all human pre-mRNAs with multiple exons give rise
to alternatively spliced mRNAs (Pan et al. 2008; Wang et al. 2008). Alternative splicing is believed to be the main mechanism in expanding the eukaryotic proteome.
Although much effort has been put into detecting functional non-proteincoding RNAs and understanding the mechanisms behind alternative splicing,
much still remains unknown. Spliceosomal snRNAs are a type of non-coding
RNA that directly influences alternative splicing by participating in the
splicing catalytic center. By studying snRNAs, we hope to contribute to the
appreciation of ncRNAs and their subsequent roles in speciation.
Noise in biological systems, trash or treasure?
Both the bona fide gene loci and RNA sequences of spliceosomal snRNAs
are known, but many previous studies have shown heterogeneity within the
spliceosomal snRNA populations (Manser and Gesteland 1982; Monstein et
19
al. 1983; Westin et al. 1984; Lund 1988; Sontheimer and Steitz 1992). This
heterogeneity is thought to arise from the “noise” in biological systems, such
as SNPs, RNA editing, RNA polymerase errors, pseudogenes and retrogenes. However, the function of the “noise” has always been debated.
In our study, a combination of both biochemical and computational approaches was used to profile spliceosomal snRNAs. Through our analysis we
hope to provide a more balanced view of the meaning of noise in biological
systems.
Non-protein-coding RNAs in evolution
In humans, non-protein-coding RNAs are not only physically abundant but
also involved in almost every step of gene expression. The fact that their
abundance is tightly correlated with the eukaryotic evolutionary tree and
reflects the complexity of eukaryotes (Taft et al. 2009) indicates their important role during evolution.
By investigating spliceosomal snRNAs in different primates, we hope to
better understand their functionalities in an evolutionary perspective.
20
SPLICEOSOMAL SNRNAS
In eukaryotes, the widespread non-protein-coding sequences can interrupt
the continuation of single protein coding sequences. Hence, newly transcribed pre-mRNAs have to go through a unique process, called splicing, to
precisely remove the non-coding sequences (introns) and join the coding
sequences (exons) before their translation into proteins.
Splicing is carried out by a macromolecular machinery known as the
spliceosome. The assembly of a spliceosome on a pre-RNA involves a set of
five small nuclear RNAs (snRNAs) and more than one hundred proteins
(Jurica and Moore 2003).
The 5’ and 3’ terminal dinucleotides of introns are universally conserved.
More than 99% of introns contain GU at the 5’ end and AG at the 3’ end;
they are referred to as U2-type introns. The U2-type introns are spliced by
the major spliceosome. Other types of introns exist, such as the U12-type.
These contain AU at the 5’ end and AC at the 3’ end. The U12-type introns
are spliced by the minor spliceosome.
The major and minor spliceosomes have many proteins in common, but
only one snRNA. Undoubtedly, the protein components of the spliceosome
play an important and irreplaceable role in the splicing process, but in this
study we only focus on the RNA part of the spliceosome and its subsequent
role in splicing.
Components of spliceosomal snRNAs
The major spliceosome contains U1, U2, U4, U5 and U6 snRNAs. In the
minor spliceosome, the corresponding snRNA components are U11, U12,
U4atac, U5 and U6atac snRNAs (Patel and Steitz 2003). Only the U5 snRNA is shared between the two spliceosomes.
The spliceosomal snRNAs can be divided into two groups on the basis of
their biogenesis pathways (Will and Luhrmann 2001; Patel and Bellini
2008). U1, U2, U4, U5, U11, U12 and U4atac snRNAs belong to one group
and are transcribed by RNA polymerase II. During biogenesis, they are transiently located in the cytoplasm, where they possess a 2,2,7-tri-methylguanosine (m3G) structure at the 5’ end and recruit seven Sm-proteins to a
conserved uridine-rich sequence (Sm-site). The m3G-cap and Sm-site are
important features for this group of spliceosomal snRNAs and they are often
21
referred to as Sm-class snRNAs. The other group is only composed of U6
and U6atac snRNAs. They are transcribed by RNA polymerase III and never
leave the nucleus during their biogenesis. Therefore they lack the features of
Sm-class snRNAs, but instead have a gamma-monomethyl phosphate cap at
the 5’ end (Shimba and Reddy 1994) and a polyuridine tail at the 3’ end.
The equivalent snRNAs in major and minor spliceosomes may differ in
sequence, but not in secondary structure (Figure 4). Indeed, the secondary
structures of spliceosomal snRNAs are highly conserved among all eukaryotes, despite large sequence differences (Marz et al. 2008). Two hypotheses
have been put forward to explain the evolutionary origin of the two spliceosomes. One is the endosymbiotic theory (Burge et al. 1998). It postulates that
a unicellular ancestor of the eukaryote possessed introns and spliceosomes,
which were split into two lineages. Later the two diverged lineages (with
accumulated differences in spliceosomal components and introns) were
fused by endosymbiosis to give rise to the ancestor of present-day eukaryotes. The other hypothesis states that the two spliceosomes have developed
through parallel evolution after duplication (Tarn and Steitz 1997). Since the
analysis of the genomic context of spliceosomal snRNAs reveals that they
behave like mobile elements, the parallel evolution theory seems to best
explain the similarities and differences between the two spliceosomes.
The roles of spliceosomal snRNAs in splicing
The number of snRNAs in a spliceosome is relatively low compared to that
of proteins, but the snRNAs are important because of their role in the splicing process. They participate in splice site selection by base pairing with the
conserved sequences at the 5’ and 3’ terminal of introns. Furthermore, they
recruit the important splicing protein factors to the spliceosome and they
catalyze the trans-esterification reactions of splicing.
Splice site recognition
The 5’ splice site (SS) recognition (Mount et al. 1983; Konarska 1998) is
generally believed to happen through base pairing between U1 snRNA (U11
in the minor spliceosome) and the exon-intron junction sequence of premRNA (Figure 5). Both in vitro and in vivo experiments have shown that the
absence of, or mutations in, the 5’end of U1 snRNA result in suppression or
blockage of the splicing process (Zhuang and Weiner 1986; Hartmann et al.
2010).
The 3’ splice site (SS) recognition also requires the recognition of an upstream conserved polypyrimidine tract and branch point sequences (BPS).
This process involves a set of protein factors, but also requires U2 snRNA
which base pairs with BPS.
22
Figure 4. The secondary structure of human spliceosomal snRNAs. The Sm-binding
sites are marked with grey shading. Modified with permission from (Yu et al. 1999;
Turunen 2012) © 1999 Cold Spring Harbor Laboratory.
Both 5’ and 3’ splice-site recognition steps are essential in splicing, because
they initiate spliceosome assembly and are responsible for intron definition,
splice-site selection and thus determine the sequence of mRNA products.
Catalysis of splicing
The actual splicing is carried out through two trans-esterification reactions
and directly involves three snRNAs: U2, U6 and U5 (U12, U6atac and U5 in
the minor spliceosome) (Figure 5).
23
Figure 5. Spliceosome assembly and catalysis actions. (A) Major spliceosome. (B)
Minor spliceosome. Modified with permission from (Patel and Steitz 2003) © 2003
Nature Publishing Group.
During the trans-esterification reactions, U2 snRNA remains base paired
with BPS. U5 snRNA interacts with nucleotides of the 5’ and 3’ exon and
U6 snRNA base pairs with U2 snRNA as well as the 5’ end of the intron. In
this arrangement, the spliceosomal snRNAs manage to keep both the 5’ and
3’ splice sites and the branch-site adenosine in the proximate location to
facilitate the nucleophilic attacks during the reaction. The first nucleophilic
attack is from the 2’-OH of branch-site adenosine to the phosphate that sepa24
rates the 5’ exon and intron sequences. This results in a cleavage of the 5’
exon and generates the lariat intermediate. The second nucleophilic attack is
from the newly formed 3’-OH group of the 5’ exon to the phosphate that
separates the 3’ exon and intron sequences, thereby generating spliced products (Will and Lührmann 2011).
Many studies have shown that spliceosomal snRNAs alone can catalyze
the required chemical reactions (Tuschl et al. 2001; Valadkhan and Manley
2001; Valadkhan et al. 2007), hence confirming the important roles of snRNAs in the spliceosome and splicing. These studies also suggest that the
spliceosome is in fact a ribozyme.
25
APPLICATION OF NEXT-GENERATION
SEQUENCING
Capillary DNA sequencing (Sanger sequencing) became a standard technology in genetic research as soon as it was first described by Sanger (Sanger et
al. 1977). It has contributed to many monumental accomplishments, including the completion of the human reference genome. However Sanger sequencing is very time consuming and costly if used for genome analysis and
it is considered a ‘first-generation’ technology. In the last five years, several
faster and cheaper high-throughput sequencing technologies have been developed. They have largely replaced Sanger sequencing in high-throughput
applications and fundamentally changed the way genetic experiments are
conducted. These new sequencing methods are referred to as the nextgeneration sequencing (NGS) technologies.
Next-generation sequencing technologies
Several NGS technology platforms are commercially available, such as Illumina/Solexa, Roche/454, Life/APG, Helico BioSciences and Pacific Biosciences, and more are under development (Metzker 2010), see Table 1. All
of them require a complex incorporation of enzymology, chemistry, highresolution optics, hardware, and software engineering.
Table 1. Examples of current next-generation sequencing technologies.
Illumina
Technology
Read length
Capacity
Run time
26
454
Sequencing by Pyrosynthesis
sequencing
SOLiD
Helico
Ligation based Single molecule real time
sequencing
100 bases
700 bases
50 bases
35 bases
<600 Gb / run ~700 Mb / run 10GB/run
Up to 35GB
3 days
10 hours
5 days
8 days
Pacific
Single molecule real time
sequencing
> 1,000 bases
~45 Mb / run
< 1 day
In comparison with Sanger sequencing, these platforms require much shorter
DNA sequencing sample-preparation time for high-throughput applications
and they are capable of parallel registering the sequencing reactions from
each DNA fragment in the library.
RNA sequencing: opportunities and challenges
RNA research has also benefited from the next-generation sequencing technologies. The possibilities to access RNA information through cDNA sequencing on a massive scale (RNA-seq) have revolutionized transcriptomics
(Ozsolak and Milos 2011). It has enabled us to progressively put the puzzle
of trancripts together, both quantitatively and qualitatively.
Several advances enabled by RNA-sequencing methods are worth mentioning. These are the comprehensive understanding of the transcription initiation sites, the ability to catalog antisense transcripts, the improved characterization of alternative splicing patterns and profiling of non-coding RNA.
These advances allow us to examine the diversity of transcription and maybe
solve the question what causes the difference in complexity between organisms. This better understanding of transcriptomes will surely help us to distinguish the functional non-protein-coding sequences from both the genomic
and transcriptional noise.
RNA-sequencing methods have limitations. A major limitation is the dependence on cDNA synthesis. This results in the input RNAs going through
many time-, sample- and labor-consuming manipulations before sequencing.
Many biases are introduced and the absolutely quantitative view of RNAs is
lost.
Emerging technologies that allow direct RNA sequencing may eliminate
the cDNA synthesis-related limitations. Major challenges remain: the capacity to generate multimillion base reads per run, error rate reduction and longer
read lengths.
27
PRESENT INVESTIGATIONS
The completion of the human genome (Lander et al. 2001; Waterston et al.
2002; Schmutz et al. 2004; Kidd et al. 2008; International Human Genome
Sequencing Consortium 2004) and the emergence of next-generation sequencing technologies have generated a tremendous amount of information
related to molecular biology. This achievement would have been impossible
without the rapid development of bioinformatics.
Bioinformatics is a field using computational tools to understand biological processes. Examples include mapping and analyzing DNA/RNA and
protein sequences, modeling DNA/RNA and protein structures. One of the
important contributions from bioinformatics in sequence analysis is gene
prediction and annotation. In this process, new protein-coding genes, regulatory RNA genes and other functional elements are identified through similarity searches using known gene or functional element sequences and structures. Interestingly, many of the newly annotated genes reside in what was
previously considered junk DNA regions. This observation provides strong
support that the content of junk DNA regions needs to be re-analyzed and
large amounts of potential regulatory RNAs exist.
In our study, we combined bioinformatical, biochemical and highthroughput sequencing approaches to profile the human spliceosome snRNA
population. By doing so, our goal was to gain insights into the heterogeneity
of the U snRNAs for an unbiased and accurate view of the diversity and
functionality of spliceosomal snRNAs.
The estimated number of spliceosomal snRNA genes (I)
The human spliceosomal snRNAs are encoded by multiple genes and early
studies revealed the variation among the U snRNA population (Manser and
Gesteland 1982; Monstein et al. 1983; Westin et al. 1984; Lund 1988;
Sontheimer and Steitz 1992). However, due to the lack of human genome
information, none of the early studies could provide a more detailed or intact
view regarding the heterogeneity of spliceosomal snRNAs. Now, with access
to the human genome and computational tools, we are able to estimate the
number of spliceosomal snRNA genes, and hence study their heterogeneity
and functionality.
28
We initiated our study by identifying U1 snRNA genes through computational approaches. We used known U1 snRNA genes (U1A, or RNU1-1) as a
reference and searched for similar sequences in the human genome. A combination of BLAST-like Alignment Tool (BLAT) (Kent 2002) and the RepeatMasker (Smit et al. 1996-2010) was used and we identified in total 188
putative U1 snRNA genes. This number is within the range of that listed in
the Ensemble database from Rfam prediction (131 in v.50 and 145 in v.64).
Later, Larsson and colleagues expanded this study to computationally
identify all spliceosomal snRNAs in different species (unpublished data,
Larsson, Kirsebom & Virtanen). Their results revealed a great heterogeneity
among spliceosomal snRNA genes across vertebrates (Table 2) and supported the hypothesis that many more functional non-protein-coding snRNAs
might participate in regulating protein expression.
Table 2. Number of Ensembl v50 snRNA genes.
Human
Chimpanzee
Orangutan
Rhesus
Mouse
Rat
Horse
Dog
Cow
Opossum
Platypus
Chicken
U1
U2
U4
U5
U6
U11
U12
131
141
133
116
141
124
27
52
60
27
95
15
70a
60a
1a
7a
29
42
24
51
27
19
19
4
62
59
59
49
13
70
4
140
0
11
3
2
20
19
20
22
10
11
7
11
0
7
10
6
906
901
887
799
545
499
290
724
0
376
116
14
5
4
7
4
5
7
2
1
0
1
0
1
2
2
2
2
5
4
1
10
2
0
2
1
U4atac U6atac
15
16
15
11
1
1
2
1
0
1
1
1
37
37
36
31
20
15
16
16
0
48
73
0
a
Numbers extracted from Ensembl v49. (Unpublished data from Larsson, Kirsebom
& Virtanen).
The biochemical approach to identification of
spliceosomal snRNA variants (I)
The computational approach provided us with a good candidate list of snRNA genes. However only some of the genes could be further validated
through a biochemical approach, due to the limited sensitivity of the biochemical approach in distinguishing very similar sequences. After manually
scanning the upstream and downstream sequences for known promoter/enhancer motifs, 3’ processing signals, or Sm-binding sites (Hernandez
and Weiner 1986; Murphy et al. 1987), we finally selected 27 promising U1
snRNA candidate genes for biochemical analysis. Two approaches, Northern
29
blot and cloning, were used in the analysis and confirmed the expression of
three U1 snRNA genes: U1A5, U1A6 and U1A7. They were ubiquitously
expressed in a wide range of human tissues and carried the important features of snRNA, i.e., m3G-caps and Sm-protein binding sites (Lerner and
Steitz 1979).
Two interesting observations were made among the newly identified U1like snRNAs (Figure 1 from paper I). One was that the population could be
further expanded by genetic variation, such as SNPs. In the case of U1A6
snRNAs, a total of 8 variable positions were found. Even though most of the
SNPs seemed to be neutral with no functional effect on U1A6 snRNA, one
was located at an important position that could influence the recognition of
the 5’ splice site. In addition, a stretch of 11 nucleotides was found either to
be present or absent in U1A6 snRNA. This insertion/deletion was long
enough to have an impact on the secondary structure and as a consequence,
on the function of U1A6 snRNA. Another interesting observation of the
newly identified U1-like snRNAs was that they lacked complementarity to
the canonical 5’ splice sites. This implies that they could function by recognizing non-canonical 5’ splice sites.
The evolutionary conservation analysis of these three U1-like snRNAs in
dog, cow and other primates suggests that they evolved from bona fide U1
snRNA genes. This fast evolution of U1 snRNA genes could be linked to
speciation.
Although only three U1-snRNA variants were thoroughly studied via biochemical approaches, the study revealed two important messages regarding
the situation of spliceosomal snRNAs in the cell. Firstly, the U snRNA
populations appear to be far more diverged and abundant than previously
thought and secondly, the encoding genes are evolving rapidly.
The high-throughput approach in profiling the
spliceosomal snRNA populations (II, III)
Our initial computational search and subsequent biochemical analysis encouraged us to expand our study. In order to accomplish that, an appropriate
high-throughput method was necessary.
The first approach we used was a custom-made high-density oligonucleotide microarray from NimbleGen (data unpublished). In this approach, we
designed 3124 specific oligos that covered 594 spliceosomal snRNA variants
found by a computational search. Due to the high similarities of variant sequences, the specificities of the oligos were limited and it was not always
possible to target only a single U snRNA variant. This resulted in a mixed
level of unspecific hybridizations between the oligos and the RNAs, which
led to complications in defining a threshold for background noise during the
30
analysis. Hence, the variant call from the microarray data was difficult to
interpret and inaccurate. We concluded that microarray technology did not
fit our aim and that we needed a high-throughput approach that could ensure
unambiguous discrimination between similar sequences at the single nucleotide level.
Subsequently, the next-generation sequencing technologies (NGS) caught
our attention. Massively sequencing RNAs had become possible through
cDNA libraries (RNA-seq). The immediate advantages of RNA-seq are that:
1) it characterizes the sample independent of a candidate gene list; and 2)
sequencing provides single nucleotide resolution of sequence differences.
Both of these advantages fit well with our study goals. However because the
existing cDNA library preparation methods were either designed for long
RNAs (>200nt) or small RNAs (~20nt), we were forced to develop our own
method to prepare cDNA libraries from spliceosomal snRNAs before we
could apply the NGS technologies.
Development of the spliceosomal snRNA cDNA library
preparation method for 454 sequencing (II)
We chose the 454 technology as the NGS platform for our study because it
generates long read lengths (>200nt) that cover the full length of spliceosomal snRNAs.
Compared to the small RNA cDNA library preparation protocol (Lu and
Shedge 2011) (Figure 1 from paper II), our method provided several advantages (Figure 3 from paper II). First, it efficiently enriched target spliceosomal snRNAs through immunoprecipitation instead of size selection of
RNA through gel purification. Second, it greatly eliminated the biases at
RNA adapter ligation steps, i.e. it removed the potential modification at the
3’ end of RNAs and optimized the ligation reactions. Third, it considerably
reduced the amount of required input RNA by using a beads-based strategy
instead of an electrophoresis-based purification. Finally, it managed to sequence at single-molecule level by eliminating the PCR amplification step.
Even though our method aimed to construct a spliceosomal snRNA
cDNA library for 454 sequencing, it has much broader applications. It can be
used in other NGS platforms such as Illumina and SOLiD and it provides
general strategies for avoiding biases and saving sample, time, and labor
during the cDNA library preparation.
Spliceosomal snRNA profiling (III)
Through bioinformatics analysis, 365 401 sequencing reads were classified
as spliceosomal snRNAs. This number represent 90% of the total reads that
we sequenced directly from eight m3G-capped RNA cDNA libraries. These
31
libraries were prepared from two different protocols and three distinct organ
systems or cell types of human, chimpanzee and rhesus (Table S1 from paper III). All known spliceosomal snRNA families (with the exception of
U6atac, which lacks 5’ m3G-cap) were found in the sequencing reads.
The relative abundance of the spliceosomal snRNAs calculated from our
sequencing read counts was in agreement with previously estimated numbers
(Hodnett and Busch 1968; Weinberg and Penman 1968; Weinberg and
Penman 1969), except for that of U11. We also noticed that the abundance of
some spliceosomal snRNAs changed when different method protocols were
applied, which indicated that they possessed unknown modifications at their
3’ ends (Figure 2 from paper III).
After we filtered away the homopolymer-related sequencing errors, we
identified the bona fide spliceosomal snRNA genes as well as several previously annotated pseudogenes and predicted genes in our genome mapping
step (Table 1 & Table S4 from paper III). However, the total number of
identified variants was much lower than we expected from the bioinformatics search (Table 2&3).
Although no significant diversity in the U snRNA population was observed in the reference genome mapping result, sequence variation was still
a frequent event that we observed in our sequencing reads. Approximately
30% of the 454 sequencing reads contained mismatches with respect to their
genomic loci and could therefore potentially contribute to the heterogeneity
of the spliceosomal snRNAs.
In order to investigate the causes of this large number of sequence variations in our sequencing result, we manually inspected a small sample of the
reads. We re-sequenced around 200 clones from the same cDNA libraries
using cloning and Sanger sequencing technology. In this way we could evaluate the frequency of sequencing errors in the 454 sequencing reads. Again,
~30% of the sequences could not be perfectly mapped to the reference genome. The manual analysis of the mismatches, especially the 33 mismatched
positions in the cloned U1 snRNA population, revealed that the sequence
variations in the sequencing reads were largely due to single nucleotide polymorphisms (SNPs) and partially due to other causes. These included e.g.
posttranscriptional modifications (e.g. A-to-I editing), transcriptional noise,
such as incorrect transcriptional initiation or mis-incorporation of nucleotide
and experimental errors. However, except for SNPs, we couldn’t unambiguously discriminate between the causes described above. The observed 33
mismatched positions in our data set corresponded to a mismatching frequency of ~3.7×10-3. Since 9 out of the 33 mismatches were due to SNPs, a
simple calculation of the SNP frequency in U snRNA would be ~1×10-3,
which is in the range of the SNP frequency in the human genome (~1×10-3 to
3×10-3). Thus, we conclude that a large amount of heterogeneity in the deep
sequencing dataset is caused by SNPs and that the origin of the remaining
32
part of heterogeneity in our deep sequencing reads cannot be unambiguously
clarified.
Table 3. Summary of the identified spliceosomal snRNA variants and their genomic
loci.
Human
Chimpanzee
Rhesus
U1
U2
U4
U5
U6
U11
U12
U4atac
U1
U2
U4
U5
U6
U11
U12
U4atac
U1
U2
U4
U5
U6
U11
U12
U4atac
Nr. of variants
Nr. of loci
7
3
5
7
1
1
1
1
4
2
2
3
1
1
1
1
5
1
2
1
0
0
1
0
14
3
5
7
7
1
1
1
6
2
2
3
5
1
1
1
14
1
2
1
0
0
1
0
Taken together, our spliceosomal snRNA profiling study suggests that most
of the U snRNA gene loci identified by the bioinformatics approach are most
likely pseudogenes. These pseudogenes are either expressed at a very low
level or are silent and only a small number of potential U snRNA loci are
expressed. The heterogeneity in the U snRNA population seems to largely
depend on the presence of the SNPs at their gene coding loci.
33
CONCLUDING REMARKS
We have used three different approaches to study the population of spliceosomal snRNAs in primates, in particularly humans. This is the first time that
the heterogeneity of well-known spliceosomal snRNA populations has been
analyzed to such depth.
Each of the approaches has contributed to our understanding of the variety and functionality of spliceosomal snRNAs. The results of the computational approach indicated that the U snRNA genes have probably experienced frequent gene duplication and retrotransposition events, resulting in
numerous homologous loci throughout the genome. The evidence for the
expression of these homologous genes was provided by the biochemical
approach. This approach identified three U1-like snRNAs that play a role in
regulation of alternative splicing by virtue of their potential to recognize
non-canonical splicing sites. Finally, the expression profile of the U snRNA
genes was obtained from the high-throughput sequencing approach. It revealed that the vast majority of U snRNAs originated from the few bona fide
gene loci, and that a significant degree of heterogeneity in the U snRNA
populations was largely due to the presence of SNPs at these loci. The
homologs, transcriptional noise, and post-transcriptional modifications could
also generate variations in the U snRNA populations. However, the frequency of these variants was much lower.
“Noise” in biological systems has been referred to as variations between
populations or individuals, as well as non-functional components such as
“junk” DNA. The biological significance of such has been an on-going debate in the field. Our study of spliceosomal snRNA populations has touched
on many aspects regarding the noise in biological systems. One suggestion is
that frequent gene duplication and retrotransposition could contribute to the
formation of “junk” DNA regions. Despite their abundance, these homologs
are rarely expressed nor do they have functionalities in the cell; in most cases they are just pseudogenes. The heterogeneity of RNA, or at least that of U
snRNAs, mainly originates from genetic variation, especially from the presence of single nucleotide polymorphism, which seems to be a major source
for the diversity of the U snRNA populations.
Recent observations have revealed that ~10% of all inheritable disorders
are associated with mutations at splice junctions (Wang and Cooper 2007).
Furthermore, several mutations of spliceosome machinery components associated with human diseases have been identified, once again emphasizing the
34
connection between malfunction of the splicing machinery and human disease (Cooper et al. 2009; Edery et al. 2011). Taken together with our results,
the SNPs at the bona fide loci encoding spliceosomal snRNAs might very
well be linked to some inheritable diseases and probably can be selected as
biomarkers in the future for diagnosis.
The high SNP frequency we detected at U snRNA gene loci certainly
leads to high frequency variations in the RNA population. It may explain
why proteins evolved within cells to replace the functionalities of RNA in
the ribozymes. The evolution of the catalytic RNase P ribozyme is a good
example of this. In RNaseP the number of proteins that are required for the
ribozyme to be functional correlates with the location of the species in the
evolutionary tree. Thus, the RNase P of bacteria only has one protein and
one RNA, while human RNase P is composed of ten proteins and one RNA
(Lai et al. 2010).
Even though we have shown in our study that most of the homologs of U
snRNAs are not expressed and presumably have no functionality, they can
definitely not be regarded as just “junk” in the genome. They play important
roles in expanding both the genome and the transcriptome by providing a
platform where new genes can derive from pre-existing genes without disturbing existing genes and their functions.
My thesis ends here, but the story of spliceosomal snRNAs will continue.
With the completion of “the 1000 genomes project" (1000 Genomes Project
Consortium. 2010) and development of sequencing technology, I believe we
will be able to draw a better picture regarding the frequency of SNPs in the
spliceosomal snRNAs, as well as their biological significance.
35
SVENSK SAMMANFATTNING
Upptäckten att antalet gener inte skiljer sig nämnvärt mellan eukaryota organismer är en av de största överraskningarna de senaste åren. Mindre komplexa organismer, t.ex. de ryggradslösa djuren nematod och bananfluga, har
ungefär lika många proteinkodande gener som betydligt mer utvecklade
ryggradsdjur, t.ex. människan. Antalet proteinkodande arvsanlag kan därför
inte vara den enda förklaringen till att organismers utvecklingsgrad och
komplexitet skiljer sig. Det har därför föreslagits att komplexiteten hos en
organism till stor del beror på komplexiteten hos proteinerna snarare än antalet arvsanlag som kodar för proteiner. En annan stor överraskning under
senare år är upptäckten att RNA molekyler har viktiga roller förutom sin roll
som informationslänk. Det är nu uppenbart att RNA fyller flera
livsavgörande funktioner i en cell, som strukturell, funktionell, reglerande
och informationsbärande molekyl. Både protein kodande och icke kodande
RNA molekyler fyller viktiga roller för att reglera proteinsyntesen och
påverkar därför komplexiteten hos proteinmönstret i en cell.
I eukaryota celler produceras proteiner via avläsning av budbärar RNA
(mRNA) i cytoplasman i en process som kallas för translation. Det mRNA
som avläses har ursprungligen producerats i cellens kärna, via en process
som kallas transkription. Den transkriberade mRNA molekylen måste dock
genomgå en serie mognadsprocesser och dessutom transporteras från kärnan
till cytoplasman innan den kan användas för proteinsyntes. En viktig
mognadsprocess är s.k. splitsning. Vid splitsning omstruktureras den ursprungligt transkriberade mRNA molekylen. Stora delar av RNA molekylen,
som kallas introner, plockas bort och de kvarvarande delarna, som kallas
exoner, kopplas ihop för att tillsammans utgöra den mogna mRNA
molekylen. Det är mycket vanligt att mRNA från en och samma gen splitsas
på olika sätt. Det kallas för alternativ splitsning och innebär att exoner kopplas ihop på fler än ett sätt. Ett och samma ursprungligt transkriberat mRNA
kan därför ge upphov till flera mogna mRNA som i sin tur ger upphov till
olika proteiner. Flera proteiner än ett kan därför produceras från ett och
samma ursprungligt transkriberat mRNA. Det har visat sig att alternativ
splitsning är betydligt vanligare hos ryggradsdjur än hos de mindre komplexa ryggradslösa djuren, nematod och bananfluga. Alternativ splitsning
antas därför ha en stor betydelse för att öka komplexiteten hos proteinernas
sammansättning hos en djurart.
36
Splitsningsprocessen kräver ett stort antal komponenter. Det behövs ett
femtiotal proteiner och åtminstone fem olika små icke kodande RNA
molekyler (snRNA). De vanligaste benämns U1, U2, U4, U5 och U6 snRNA. De fem U snRNA molekylerna bildar tillsammans med femtiotalet proteiner själva splitningsmaskineriet som kallas för splitsosomen. I
splitsosomen har U snRNA två viktiga funktioner, dels känner de igen
ställen på den ursprungliga mRNA molekylen som skall splitsas, dvs. klyvas
och kopplas ihop och dels deltar de som katalysatorer av de kemiska reaktioner som måste ske. I samband med att människans DNA sekvens klarlades blev det uppenbart att det fanns ett stort antal potentiella gener som kodade för U snRNA molekylerna. Många av dessa U snRNA gener inom en
och samma familj, t. ex. de som kodar för U1 snRNA, skiljer sig avsevärt åt
från varandra medan andra endast skiljer sig åt på några få ställen. Detta
tyder på att det kan finnas en stor heterogenitet inom en och samma U snRNA population. Heterogeniteten hos U snRNA skulle därmed kunna ge en
förklaring till hur proteomets komplexitet ökar eftersom små variationer i U
snRNA sekvenserna kan leda till att splitningsprocessen påverkas och då
speciellt med hur effektivt olika kombinationer av exoner sammanfogas.
Med andra ord skulle en stor heterogenitet inom U snRNA populationen
kunna vara en bakomliggande mekanistisk orsak till den stora förekomsten
av alternativ splitsning hos ryggradsdjuren. I mitt avhandlingsarbete har jag
därför studerat sammansättningen av U snRNA populationerna hos människa samt jämfört människans mönster med motsvarande mönster hos schimpans och makaker. Vi valde dessa tre ryggradsdjur eftersom vi redan i vår
första studie kunde påvisa att U snRNA gener evolverar mycket snabbt, så
snabbt att det redan finns skillnader mellan människan och dess evolutionärt
närmaste släktingar.
Jag använde mig av tre olika experimentella tillvägagångssätt, bioinformatik, biokemisk karaktärisering och storskalig sekvensering av RNA
molekyler. Sammantaget visade mina studier att det finns en stor grad av
heterogenitet inom respektive U snRNA population. Överraskande var dock
att denna heterogenitet huvudsakligen var kopplad till en skillnad mellan
individer än till vilket organ eller organism U snRNA molekylen kom från.
Vi observerade dessutom att skillnaderna när de väl uppträdde var mycket
små, dvs. att endast någon enstaka nukleotid i respektive U snRNA gen hade
förändrats. Skillnaderna var därmed så små att de med största sannolikhet
inte kommer att påverka U snRNA molekylens funktion. Tre sällsynta undantag kunde vi dock hitta. I de fallen handlade det om tre varianter av U1
snRNA hos människa. Alla dessa tre fall berodde på förändringar i en region
som känner igen själva splitsningsstället och förändringarna innebar att
ställen som normalt inte skulle kännas igen som splitsningsställen skulle
kunna kännas igen av de tre varianterna. Förekomsten av dessa tre varianter
är dock så liten att enbart deras förekomst inte på långa vägar kan förklara
37
den höga graden av alternativ splitsning som finns hos ryggradsdjur,
inklusive primaterna.
Sammantaget har avhandlingsarbetet visat att förhållandevis få av det
stora antalet potentiella U snRNA gener ger upp till huvuddelen av de U
snRNA molekyler som finns i en eukaryot cell. Mitt arbete visade också att
det i grund och botten endast är små skillnader mellan U snRNA populationernas sammansättning hos olika primater. Några skillnader finns men ytterligare studier krävs för att klarlägga om och hur de påverkar egenskaper hos
de tre primater jag jämfört. De skillnader jag fann var huvudsakligen relaterade till de små skillnader i DNA sekvensen som finns när man jämför två
olika individer. Framtida studier krävs därför även här för att klarlägga om
dessa små skillnader kan förklara varför vi som individer trots allt skiljer oss
åt, om än skillnaderna oss emellan är små. Det är dock inte en omöjlighet att
sådana små förändringar kan ha en påverkan eftersom man redan idag har
funnit individer som pga. mycket små förändringar i sina U snRNA gener
drabbas av genetiskt nedärvda allvarliga utvecklingsstörningar och sjukdomar.
Översatt från engelska av Anders Virtanen
38
ACKNOWLEDGEMENTS
I express my sincere gratitude to all the people who have supported me in
various ways during these years. This thesis would not have been possible
without them! In particular:
My supervisor Anders Virtanen, for accepting me as a PhD student, for
introducing me to the “RNA world”, for sharing your knowledge and ideas,
for being enthusiastic and optimistic in science and for your encouragement
and challenges. The discussions with you are always fruitful and inspiring.
You are truly a great teacher!
My co-supervisor Leif Kirsebom, for sharing your thoughts and for all the
discussions.
My collaborator Ed Green, for bringing me to the MPI, for being supportive
and helpful, no matter how busy your are, and for being such a kind person.
It was a wonderful experience for me to work in MPI and I really enjoyed
the time when I was there.
My poly(A) girls and boys, for making work such a pleasant place to be at,
in particular:
Christina, for introducing me to the project and lab when I just arrived, for
showing me what is the right attitude in research. Jens, for all the supports
and help both in and outside the lab throughout the years. You have told me
a lot of things, especially, how to live busy and happy. Together with Babett
and kids, for the countless cozy and happy times! You are my favorite family! Pontus, my brother in “U snRNA”, for sharing the ideas and solving my
computer problems, for understanding my complaints and frustrations regarding data analysis. Per, for always being calm and patient, for teaching
me “tricks” in the lab and most important for the friendship! Niklas, for
being such a good “lab manager” and making the lab a nice place to be at,
for teaching me how to teach and for thoroughly proofreading my thesis.
Harri, for helping me with experiments when I was away, for lending interesting books and music to me; Johan, for being my first project student and
for teaching me my first shell program; Micke, for all the interesting discussions and chats, for reminding me to relax now and then; Mattias and Magnus, for the nice company.
39
The former members from the “MCB” program for the nice atmosphere and
long-term friendships, in particular:
Nicole, for listening to my problems, for always being cheerful and encouraging, together with Christoph, for all the fun things we did together, we
need to build that big Carcassonne map again☺ Hope to see you soon! Agata, together with Grzegorz and Sonia, for all the fun parties and interesting
discussions; Jia, for all the help, especially when I had just come to Sweden.
Itaxso for all the nice times and chats.
My collaborators at the Max Planck Institute for Evolutionary Anthropology,
in particular:
Svante, for supporting this project; Matthias, for advice and running all the
q-PCRs, nice discussions and poker games.
All present and former colleagues at ICM, in particular:
The administrators—Sigrid, Åsa, Ewa and Anders for their support and
assistance; Akiko, for all the help and nice conversations.
My dearest friends, in particular:
The GIRLs: Anna, Irene and Ida, for all the fun, memories, chats and understanding. I miss you all!
The families: Fredbys; Halls; Natalia, Andrey and Albert; Gelmine, Larry and Vincent for sharing the laughs and joys.
In Sweden: Xiaomin, Shiying, Ning, Yanlin and Hanna, for cheering me
up.
In China: Yi and Zhijia, for more than 25 years of friendship.
Dr. Jin-ping Li, for helping me to seek study in Sweden and for backing me
up since then.
Prof. Aihua Liang, for accepting me to do the bachelor thesis project in
your lab, where my research life began.
My Conze family, especially Pappi and Mammi, for always welcoming me
with a warm heart.
My 爸 爸 (baba) and 妈 妈 (mama), for your unconditional love and trust; for
always believing in me!
My precious Louis, for changing me into a strong and efficient person. Life
would be so boring without you!
My beloved hubby Tim, for walking through the dark time with me and
celebrate in the bright; for your encouragement, support and even criticism ♥
40
REFERENCES
1000 Genomes Project Consortium. 2010. A map of human genome variation from
population-scale sequencing. Nature 467(7319): 1061-1073.
ENCODE Project Consortium 2004. The ENCODE (ENCyclopedia Of DNA
Elements) Project. Science 306(5696): 636-640.
International Human Genome Sequencing Consortium 2004. Finishing the
euchromatic sequence of the human genome. Nature 431(7011): 931-945.
Avery OT, Macleod CM, McCarty M. 1944. Studies on the Chemical Nature of the
Substance Inducing Transformation of Pneumococcal Types : Induction of
Transformation by a Desoxyribonucleic Acid Fraction Isolated from
Pneumococcus Type Iii. The Journal of experimental medicine 79(2): 137-158.
Bartel DP. 2009. MicroRNAs: target recognition and regulatory functions. Cell
136(2): 215-233.
Beadle GW, Tatum EL. 1941. Genetic Control of Biochemical Reactions in
Neurospora. Proceedings of the National Academy of Sciences of the United
States of America 27(11): 499-506.
Berget SM, Moore C, Sharp PA. 1977. Spliced segments at the 5' terminus of
adenovirus 2 late mRNA. Proceedings of the National Academy of Sciences of
the United States of America 74(8): 3171-3175.
Boveri T. 1904. Ergebnisse über die Konstitution der chromatischen. Substanz des
Zelkerns, Fisher, Jena.
Burge CB, Padgett RA, Sharp PA. 1998. Evolutionary fates and origins of U12-type
introns. Molecular cell 2(6): 773-785.
Chow LT, Gelinas RE, Broker TR, Roberts RJ. 1977. An amazing sequence
arrangement at the 5' ends of adenovirus 2 messenger RNA. Cell 12(1): 1-8.
Cooper TA, Wan L, Dreyfuss G. 2009. RNA and disease. Cell 136(4): 777-793.
Crick F. 1970. Central dogma of molecular biology. Nature 227(5258): 561-563.
Crick FH. 1958. The biological replication of macromolecules. Symposia of the
Society for Experimental Biology XII(138).
Doolittle RF. 1986. Of urfs and orfs: a primer on how to analyze derived amino acid
sequences. University Science Books, Mill Valley, CA.
Edery P, Marcaillou C, Sahbatou M, Labalme A, Chastang J, Touraine R, Tubacher
E, Senni F, Bober MB, Nampoothiri S et al. 2011. Association of TALS
developmental disorder with defect in minor splicing component U4atac snRNA.
Science 332(6026): 240-243.
Frith MC, Pheasant M, Mattick JS. 2005. The amazing complexity of the human
transcriptome. European journal of human genetics : EJHG 13(8): 894-897.
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O,
Zhang ZD, Weissman S, Snyder M. 2007. What is a gene, post-ENCODE?
History and updated definition. Genome research 17(6): 669-681.
Griffith F. 1928. The Significance of Pneumococcal Types. The Journal of hygiene
27(2): 113-159.
41
Hamilton AJ, Baulcombe DC. 1999. A species of small antisense RNA in
posttranscriptional gene silencing in plants. Science 286(5441): 950-952.
Hartmann L, Neveling K, Borkens S, Schneider H, Freund M, Grassman E, Theiss S,
Wawer A, Burdach S, Auerbach AD et al. 2010. Correct mRNA processing at a
mutant TT splice donor in FANCC ameliorates the clinical phenotype in
patients and is enhanced by delivery of suppressor U1 snRNAs. American
journal of human genetics 87(4): 480-493.
Hernandez N, Weiner AM. 1986. Formation of the 3' end of U1 snRNA requires
compatible snRNA promoter elements. Cell 47(2): 249-258.
Hershey AD, Chase M. 1952. Independent functions of viral protein and nucleic acid
in growth of bacteriophage. The Journal of general physiology 36(1): 39-56.
Hodnett JL, Busch H. 1968. Isolation and characterization of uridylic acid-rich 7 S
ribonucleic acid of rat liver nuclei. The Journal of biological chemistry 243(24):
6334-6342.
Jacob F, Perrin D, Sanchez C, Monod J. 1960. [Operon: a group of genes with the
expression coordinated by an operator]. Comptes rendus hebdomadaires des
seances de l'Academie des sciences 250: 1727-1729.
Johannsen W. 1909. Elemente der exakten Erblichkeitslehre. Gustav Fischer, Jena.
Jurica MS, Moore MJ. 2003. Pre-mRNA splicing: awash in a sea of proteins.
Molecular cell 12(1): 5-14.
Kent WJ. 2002. BLAT--the BLAST-like alignment tool. Genome research 12(4):
656-664.
Keren H, Lev-Maor G, Ast G. 2010. Alternative splicing and evolution:
diversification, exon definition and function. Nature reviews Genetics 11(5):
345-355.
Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N,
Teague B, Alkan C, Antonacci F et al. 2008. Mapping and sequencing of
structural variation from eight human genomes. Nature 453(7191): 56-64.
Konarska MM. 1998. Recognition of the 5' splice site by the spliceosome. Acta
biochimica Polonica 45(4): 869-881.
Lai LB, Vioque A, Kirsebom LA, Gopalan V. 2010. Unexpected diversity of RNase
P, an ancient tRNA processing enzyme: challenges and prospects. FEBS letters
584(2): 287-296.
Lander ES Linton LM Birren B Nusbaum C Zody MC Baldwin J Devon K Dewar K
Doyle M FitzHugh W et al. 2001. Initial sequencing and analysis of the human
genome. Nature 409(6822): 860-921.
Lasda EL, Blumenthal T. 2011. Trans-splicing. Wiley interdisciplinary reviews RNA
2(3): 417-434.
Lerner MR, Steitz JA. 1979. Antibodies to small nuclear RNAs complexed with
proteins are produced by patients with systemic lupus erythematosus.
Proceedings of the National Academy of Sciences of the United States of
America 76(11): 5495-5499.
Lodish H, Berk A, Zipursky SL, Matsudaira P, Baltimore D, Darnell. J. 2000.
Molecular Cell Biology, 4th edition. W.H.Freeman, New York.
Lu C, Shedge V. 2011. Construction of small RNA cDNA libraries for highthroughput sequencing. Methods Mol Biol 729: 141-152.
Lund E. 1988. Heterogeneity of human U1 snRNAs. Nucleic acids research 16(13):
5813-5826.
Manser T, Gesteland RF. 1982. Human U1 loci: genes for human U1 RNA have
dramatically similar genomic environments. Cell 29(1): 257-264.
Marz M, Kirsten T, Stadler PF. 2008. Evolution of spliceosomal snRNA genes in
metazoan animals. Journal of molecular evolution 67(6): 594-607.
42
Mattick JS. 2001. Non-coding RNAs: the architects of eukaryotic complexity.
EMBO reports 2(11): 986-991.
Mattick JS, Makunin IV. 2006. Non-coding RNA. Human molecular genetics 15
Spec No 1: R17-29.
Mendel GJ. 1866. Versuche über Pflanzenhybriden. Verhandlungen des
naturforschenden Vereines in Brünn, Bd. IV für das Jahr, 1865
Abhandlungen:3–47. For the English translation, see: Druery, C.T and William
Bateson (1901). "Experiments in plant hybridization". Journal of the Royal
Horticultural Society 26: 1–32. Retrieved 2009-10-09.
Metzker ML. 2010. Sequencing technologies - the next generation. Nature reviews
Genetics 11(1): 31-46.
Min Jou W, Haegeman G, Ysebaert M, Fiers W. 1972. Nucleotide sequence of the
gene coding for the bacteriophage MS2 coat protein. Nature 237(5350): 82-88.
Monstein HJ, Hammarstrom K, Westin G, Zabielski J, Philipson L, Pettersson U.
1983. Loci for human U1 RNA: structural and evolutionary implications.
Journal of molecular biology 167(2): 245-257.
Morgan TH. 1910. Sex Limited Inheritance in Drosophila. Science 32(812): 120-122.
Morris KV. 2009. Non-coding RNAs, epigenetic memory and the passage of
information to progeny. RNA biology 6(3): 242-247.
Mount SM, Pettersson I, Hinterberger M, Karmas A, Steitz JA. 1983. The U1 small
nuclear RNA-protein complex selectively binds a 5' splice site in vitro. Cell
33(2): 509-518.
Muller HJ. 1927. Artificial Transmutation of the Gene. Science 66(1699): 84-87.
Murphy JT, Skuzeski JT, Lund E, Steinberg TH, Burgess RR, Dahlberg JE. 1987.
Functional elements of the human U1 RNA promoter. Identification of five
separate regions required for efficient transcription and template competition.
The Journal of biological chemistry 262(4): 1795-1803.
Nirenberg MW, Matthaei JH. 1961. The dependence of cell-free protein synthesis in
E. coli upon naturally occurring or synthetic polyribonucleotides. Proceedings
of the National Academy of Sciences of the United States of America 47: 15881602.
Ohno S. 1972. So much "junk" DNA in our genome. Brookhaven symposia in
biology 23: 366-370.
Ozsolak F, Milos PM. 2011. RNA sequencing: advances, challenges and
opportunities. Nature reviews Genetics 12(2): 87-98.
Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. 2008. Deep surveying of alternative
splicing complexity in the human transcriptome by high-throughput sequencing.
Nature genetics 40(12): 1413-1415.
Patel AA, Steitz JA. 2003. Splicing double: insights from the second spliceosome.
Nature reviews Molecular cell biology 4(12): 960-970.
Patel SB, Bellini M. 2008. The assembly of a spliceosomal small nuclear
ribonucleoprotein particle. Nucleic acids research 36(20): 6482-6493.
Pheasant M, Mattick JS. 2007. Raising the estimate of functional human sequences.
Genome research 17(9): 1245-1253.
Pink RC, Wicks K, Caley DP, Punch EK, Jacobs L, Carter DR. 2011. Pseudogenes:
pseudo-functional or key regulators in health and disease? RNA 17(5): 792-798.
Ponicsan SL, Kugel JF, Goodrich JA. 2010. Genomic gems: SINE RNAs regulate
mRNA production. Current opinion in genetics & development 20(2): 149-155.
Portela A, Esteller M. 2010. Epigenetic modifications and human disease. Nature
biotechnology 28(10): 1057-1068.
Ramsey S, Ozinsky A, Clark A, Smith KD, de Atauri P, Thorsson V, Orrell D,
Bolouri H. 2006. Transcriptional noise and cellular heterogeneity in mammalian
43
macrophages. Philosophical transactions of the Royal Society of London Series
B, Biological sciences 361(1467): 495-506.
Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating
inhibitors. Proceedings of the National Academy of Sciences of the United
States of America 74(12): 5463-5467.
Schmutz J, Wheeler J, Grimwood J, Dickson M, Yang J, Caoile C, Bajorek E, Black
S, Chan YM, Denys M et al. 2004. Quality assessment of the human genome
sequence. Nature 429(6990): 365-368.
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H,
Walker M, Chi M et al. 2004. Large-scale copy number polymorphism in the
human genome. Science 305(5683): 525-528.
Shimba S, Reddy R. 1994. Purification of human U6 small nuclear RNA capping
enzyme. Evidence for a common capping enzyme for gamma-monomethylcapped small RNAs. The Journal of biological chemistry 269(17): 12419-12423.
Sie CP, Kuchka M. 2011. RNA editing adds flavor to complexity. Biochemistry
Biokhimiia 76(8): 869-881.
Smit, AFA, Hubley R, Green P. 1996-2010. RepeatMasker Open-3.0.
http://www.repeatmasker.org.
Sontheimer EJ, Steitz JA. 1992. Three novel functional variants of human U5 small
nuclear RNA. Molecular and cellular biology 12(2): 734-746.
Spilianakis CG, Lalioti MD, Town T, Lee GR, Flavell RA. 2005. Interchromosomal
associations between alternatively expressed loci. Nature 435(7042): 637-645.
Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke
L, Clee C, Coghlan A et al. 2003. The genome sequence of Caenorhabditis
briggsae: a platform for comparative genomics. PLoS biology 1(2): E45.
Sutton WS. 1903. The chromosomes in heredity. Biological Bulletin 4: 231-251.
Taft RJ, Glazov EA, Cloonan N, Simons C, Stephen S, Faulkner GJ, Lassmann T,
Forrest AR, Grimmond SM, Schroder K et al. 2009. Tiny RNAs associated with
transcription start sites in animals. Nature genetics 41(5): 572-578.
Tarn WY, Steitz JA. 1997. Pre-mRNA splicing: the discovery of a new spliceosome
doubles the challenge. Trends in biochemical sciences 22(4): 132-137.
Turunen J. 2012. The function and regulation of the U11-48K protein in U12dependent splicing. Helsinki Univerisity, Helsinki.
Tuschl T, Sharp PA, Bartel DP. 2001. A ribozyme selected from variants of U6
snRNA promotes 2',5'-branch formation. RNA 7(1): 29-43.
Valadkhan S, Manley JL. 2001. Splicing-related catalysis by protein-free snRNAs.
Nature 413(6857): 701-707.
Valadkhan S, Mohammadi A, Wachtel C, Manley JL. 2007. Protein-free
spliceosomal snRNAs catalyze a reaction that resembles the first step of splicing.
RNA 13(12): 2300-2311.
Vanin EF, Goldberg GI, Tucker PW, Smithies O. 1980. A mouse alpha-globinrelated pseudogene lacking intervening sequences. Nature 286(5770): 222-226.
Villa-Komaroff L, Guttman N, Baltimore D, Lodishi HF. 1975. Complete
translation of poliovirus RNA in a eukaryotic cell-free system. Proceedings of
the National Academy of Sciences of the United States of America 72(10): 41574161.
Wain HM, Bruford EA, Lovering RC, Lush MJ, Wright MW, Povey S. 2002.
Guidelines for human gene nomenclature. Genomics 79(4): 464-470.
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF,
Schroth GP, Burge CB. 2008. Alternative isoform regulation in human tissue
transcriptomes. Nature 456(7221): 470-476.
44
Wang GS, Cooper TA. 2007. Splicing in disease: disruption of the splicing code and
the decoding machinery. Nature reviews Genetics 8(10): 749-761.
Wang KC, Chang HY. 2011. Molecular mechanisms of long noncoding RNAs.
Molecular cell 43(6): 904-914.
Waterston RH Lindblad-Toh K Birney E Rogers J Abril JF Agarwal P Agarwala R
Ainscough R Alexandersson M An P et al. 2002. Initial sequencing and
comparative analysis of the mouse genome. Nature 420(6915): 520-562.
Watson JD, Crick FH. 1953. The structure of DNA. Cold Spring Harbor symposia
on quantitative biology 18: 123-131.
Weinberg R, Penman S. 1969. Metabolism of small molecular weight monodisperse
nuclear RNA. Biochimica et biophysica acta 190(1): 10-29.
Weinberg RA, Penman S. 1968. Small molecular weight monodisperse nuclear
RNA. Journal of molecular biology 38(3): 289-304.
Westin G, Zabielski J, Hammarstrom K, Monstein HJ, Bark C, Pettersson U. 1984.
Clustered genes for human U2 RNA. Proceedings of the National Academy of
Sciences of the United States of America 81(12): 3811-3815.
Whitehead J, Pandey GK, Kanduri C. 2009. Regulation of the mammalian
epigenome by long noncoding RNAs. Biochimica et biophysica acta 1790(9):
936-947.
Will CL, Luhrmann R. 2001. Spliceosomal UsnRNP biogenesis, structure and
function. Current opinion in cell biology 13(3): 290-301.
Will CL, Lührmann R. 2011. Spliceosome structure and function. Cold Spring
Harbor Perspectives in Biology 3(7).
Willingham AT, Gingeras TR. 2006. TUF love for "junk" DNA. Cell 125(7): 12151220.
Wold F. 1981. In vivo chemical modification of proteins (post-translational
modification). Annual review of biochemistry 50: 783-814.
Yu Y-T, Scharl EC, Smith CM, Steitz JA. 1999. The Growing World of Small
Nuclear Ribonucleoproteins. In The RNA World, 2nd Ed, pp. 487-524.
Zhuang Y, Weiner AM. 1986. A compensatory base change in U1 snRNA
suppresses a 5' splice site mutation. Cell 46(6): 827-835.
45