Upload
edmund-ray
View
220
Download
3
Tags:
Embed Size (px)
Citation preview
Genomic programming of complex organisms:
the hidden layer of noncoding RNA
Numbers of protein coding genes do not scale strongly with complexity: - fly and worm protein-coding gene numbers (14-19,000) are only about 2-3
times those of yeast (6,000) and P. aeruginosa (5,500)- mammalian (human, mouse) protein-coding gene numbers (~30,000) are
only about twice those of invertebrates (and less than some plants).
The genetic basis of eukaryotic complexity and variation
The relative amount of noncoding DNA does scale with complexity
Vertebrates
Ciona (chordate)
Invertebrates
Plants
Complex fungi
Simpler eukaryotes
Prokaryotes
Either the human genome is replete with useless transcription or these non-protein-coding RNAs are fulfilling some unexpected function
~ 98% of the transcriptional output in humans is noncoding RNA
- around 95-97% of the primary transcript of protein coding genes is intronic
- there are enormous numbers of noncoding RNA genes in the mammalian genome, which are only now beginning to be recognised,
and which appear to account for between 1/2 and 3/4 of all transcripts
At least 50% and possibly the majority of the human genome is transcribed
- 1.5% protein coding indicates that a minimum of 30% of the genome is transcribed (x20)
- if equal number of noncoding RNA transcripts, then over 60% is transcribed
The genetic basis of human complexity and variation
The central dogma states that genetic information flows from DNA to RNA to protein
This is usually interpreted to mean that genetic information flows from DNA to protein (via mRNA), i.e. genes are generally synonymous with proteins
This is true in prokaryotes, whose genomes are comprised (75-95%) of wall-to-wall protein coding sequences flanked by 5’ and 3’ regulatory elements.
It has been assumed that the same applies in eukaryotes…..with the logical extension (i) that the increased complexity of eukaryotes is explained by the combinatorics of regulatory factors intersecting with more complex promoters etc., with the corollary (ii) that non-protein-coding sequences in eukaryotic genomes (98.5% in humans) are either cis-regulatory elements or evolutionary junk.
Thus proteins comprise not only the functional and structural components of cells, but are also the agents by which the system is regulated in conjunction with cis-elements and environmental signals.
These assumptions are now articles of faith….but they are not necessarily correct.
The assumption that genes are synonymous with proteins (and that proteins are all that is required to program a complex organism) has led to all sorts of subsidiary assumptions, notably that introns (because they do not encode protein and despite the fact that they are transcribed), do not transmit genetic information into the system (as RNA molecules).
If one considers the reasonable alternative, i.e. if intronic RNA is functional, then there are a number of logical extensions:
3. These RNAs must be processed post-splicing and be transmitting information via RNA-DNA, RNA-RNA and RNA-protein interactions, presumably sequence-specifically. This equates to a quasi-digital feed-forward regulatory system that would, in theory, permit integration of complex suites of gene activity and regulatory regimes, and crucially the programming of gene expression profiles throughout the trajectories of differentiation and development.
1. Genetic information is being expressed both as RNA and as proteins, and intronic RNA is transmitting secondary information in parallel with protein-coding sequences, which must be involved in networking of gene activity. Thus the genetic operating system is different between eukaryotes and prokaryotes.
2. Some, perhaps many genes, will have evolved only to express RNA signals.
1. The majority of the genomic sequence in the higher organisms (the non-protein-coding DNA) is devoted to the control of developmental programming.
2. The majority of the regulatory transactions during development in the higher organisms are conveyed by RNAs, not proteins, although the two classes of regulatory controls work in concert.
3. The combinatorics of protein regulators intersecting with environmental signals in itself provides insufficient state information for the programming of differentiation and development. Rather the importance of protein signaling is to provide contextual cues to guide and to tune the (RNA-directed) endogenously programmed pathways by providing positional information and correcting stochastic errors.
4. The complexity of prokaryotes has been limited throughout evolution not by biochemical or environmental factors, or cell structure, but by a primitive regulatory system based on proteins alone.
Prokaryotic gene
mRNA
protein
Eukaryotic gene
mRNA and/or eRNA
catalytic function structural role regulation
protein
networking functions
catalytic function structural role regulation
SINGLE OUTPUTSIMPLE OPERATING SYSTEM
MULTIPLEX OUTPUTPARALLEL PROCESSING
Hidden layer
• 98% of all transcription in humans is ncRNA, and over half of the human genome is transcribed
• There are 20,000 pseudogenes in the human genome - many of which are transcribed, and at least one of which has function (as an RNA, regulating the expression of its protein-coding homolog).
• There are huge blocks of conservation in introns and intergenic sequences
• At least 60% of all human genes have associated antisense transcripts - mechanism of imprinting, transvection and inter-allelic communication? Other?
• Introns and ncRNAs are processed to microRNAs
• All well studied gene loci in animals have a majority of noncoding transcripts (callipyge, IGFR, Gnas, globin, bithorax etc.).
• At least 50% of mouse cDNAs do not appear to encode protein, and whole transcript mapping of human chromosomes shows an order of magnitude higher numbers of transcripts than expected from protein-coding sequences
• A wide range of complex genetic phenomena in eukaryotes (RNAi, co-suppression, transvection, transinduction) are directed by RNA.
Aparicio et al. Science 297, 1301-1310 (2002)
Not all introns have evolved function, as each intron is the descendant of an independent insertion event and is evolving independently, albeit in the context of their host transcript - as exemplified by the intron distribution in Fugu rubripes.
Multiple microRNAs (miRs) are produced by processing of “intergenic” noncoding RNAs and introns
Lagos-Quintana et al. Science 294, 853-858 (2001)Lau et al. Science 294, 858-862 (2001)
Lee and Ambros Science 294, 862-864 (2001)
“Glimpses of a tiny RNA world”Ruvkun Science 294, 797-799 (2001)
Imprinted micro RNA genes transcribed antisense to a reciprocally imprinted retrotransposon-like gene. Seitz et al. (2003) Nature Genetics 34, 261-262.
Control of alternative splicing
Functions of trans-acting RNA signals
Epigenetic modification
Transcriptional regulation
Control of mRNA turnover
Control of translation
Signal transduction
RNA modification and editing
gene (transcription unit / cluster)
exons + introns
protein
catalytic functions structural roles signal transduction and regulation of gene expression
mRNA or ncRNA
transcription
primary transcript
splicing
other functions
processing
snoRNAsmicroRNAs
assembly
networking
Revised definition of gene and flow of genetic information
Rnomics
Definition of transcriptome and the dynamic expression of ncRNA genes in differentiation and development: mammalian ncRNA
database and development of ncRNA chips
Computational analysis of homology patterns in noncoding sequences within and between genomes: evolutionary constraints, network wiring,
mechanistic clues, selection of candidates for molecular genetic analysis. Genetic and molecular genetic analyses in model organisms: yeast, C.
elegans, Drosophila and vertebrates: proof-of-principle experiments to show that introns are transmitting genetic signals, genetic and
biochemical resource development to test mechanism.
Modelling of eRNA genetic networks. Proof that regulatory networks are accelerating networks and that prokaryotes have been limited by
regulatory overhead.
The numbers of regulators must rise as a non-linear (quadratic) function of the number of regulons (genes or co-regulated modules of genes) - in prokaryotes “operons”, in eukaryotes (mostly) individual genes, and splice variants thereof.
[r] = n2
Each new regulon requires at least one new regulator (or regulatory combination), and an additional higher order regulator, depending on the degree of required connectivity (i.e. coordination with) other genes or suites of genes in the network. The existing regulatory network of the cell has also to be expanded to integrate the activity of the new regulon, if the system is not to be become disconnected.
Accelerating regulatory networks
The complexity of any organism (and indeed of any organized system) must ultimately be limited by its regulatory overhead, whose ceiling can only be breached by a fundamental change in the regulatory architecture.
Prokaryotic genomes have maximum size of ~12Mb
Prokaryotic complexity is limited by regulatory overhead
The numbers of regulators in prokaryotes scales as a square function of genome size (number of regulons).
Larry Croft, Martin Lercher, Michael Gagen and John Mattick, http://au.arxiv.org/abs/q-bio.MN/0311021
R = 0.0000163 N1.96 (95% confidence limit 1.81 - 2.11)
Prokaryotic genomes have maximum size of ~12Mb
Prokaryotic complexity is limited by regulatory overhead
The numbers of regulators in prokaryotes scales as a square function of genome size (number of regulons).
Additional regulatory strategies in eukaryotes:
Combinatorics?
Chromatin modification
Alternative splicing
RNA?
Advantages: ~I.5 log reduction in coding requirement + sequence specificity
Co
mp
lexi
ty Multicellular world
Unicellular world
eubacteria
archaea
plants fungi
animals
-4,000 -3,000 -2,000 -1,000 present
Time (mya)
single-celled eukaryotes
A simplified biological history of the Earth
(protista)
Actual landscape
Simulated landscape
A large fraction of the mammalian genome is under evolutionary
selection
Control of alternative splicing
Functions of trans-acting RNA signals
Boundary sequences across intron exon junctions of alternatively spliced exons are conserved (e.g. WT1)
- massive changes in splicing patterns during differentiation
- mechanism of control of alternative splicing unknown
- cis-acting regulatory sequences frequently mapped close to intron-exon junction
- antisense oligoribonucleotides directed against splice junctions can alter splicing patterns in cultured cells and transgenic mice
WT1 gene – nucleotide sequences around KTS alternative splice site
EXON 8/8a EXON 9
R S D H L K T H T R T H T G K T S E K P F S C R W P S C Q K Khuman CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCCAAGTTGTCAGAAAAAGTTTGCmouse CGGTCCGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCACAGTTGTCAGAAAAAGTTTGCrat CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCACAGTTGTCAGAAAAAGTTpig CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCCCAGTTGTCAGAAAAAGTTTGCdunnart CGGTCTGACCACCTGAAGACACACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGCCGGTGGCCCAGTTGTCAGAAAAAATTTGCchicken AGATCTGATCATCTGAAGACTCATACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCTTTCAGTTGCCGATGGCCCAGCTGTCAAAAAAAAturtle AGATCTGATCATCTGAAGACTCATACCAGGACTCATACAGGTAAAACAA GTGAAAAACCATTCAGCTGTAGATGGCCCAGCTGTCAAAAAAAATTTalligator AGGTCTGATCATCTAAAGACTCACACCAGGACTCATACAGGTAAAACAA GTGAAAAACCATTCAGCTGTCGATGGCCCAGCTGTCAAAAAAAATTTGCnewt CGATCTGACCATTTGAAGACACACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCTTTTAGCTGCCGATGGCCCAGTTGTCAAAAGAAATTTGCxenopus AGGTCCGACCACCTGAAGACTCACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCCTTTAGCTGCAGGTGGCCAAGTTGCCAGAAGAAGTTTGCeel CGATCTGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCATTCACCTGCAGGTGGCCAAACTGTCAGAAGAAGTTTGC
EXO N 8/8a INTRO N 8
human CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-cattatttttttaaactatmouse CGGTCCGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacgtttattttttcattatttt-ctaagctacrat CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattctt-cattatttt-ctaagctacdog CGGTCTGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-cattatttttttaaaaaaacow CGTTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-ctttatttttttaaactac
zeb 1 CGTTCAGACCACCTTAAGACCCACACCCGGACACATACAGGTAAAACAA gtgcgtaaaccttttcatttttttcatgattccctcctctctttcactczeb 2 CGCTCGGACCACCTGAAGACACACACGCGGACTCATACAGGTAAAACAA GTGCgtaagactcctctttcaaacctctcattaaccctttcttcccttafugu 3 AGGTCTGACCATCTTAAGACACACACCCGGACTCATACAGGTAAAACAA gtgcgtaaacttttattttttccttttcttcttcacccacatgtttacacfugu 2 CGCTCCGACCACCTTAAGACGCACACTCGGACTCATACAGGTAAACCAA gtgcgtacctttatttcctacttctccattatcagcttgttgttatttt
Conservation of nucleotide sequences around alternative splice sites
Bioinformatic prediction of putative RNA signalling networks in yeast
Genetic evidence suggests that most RNA-mediated regulation and signalling in the eukaryotes is homology-dependent.
If intronic and other noncoding RNAs are functioning as trans-acing signals, at least a subset of their targets should be detectable by homology elsewhere in the genome (operating either at the RNA or DNA level, although the rules for sequence recognition may vary).
We have instituted a bioinformatic search of the yeast genome for intron sequences, focussing on those of length ≥ 16 nucleotides, that have exact matches elsewhere. Yeast was selected as a training set for development of suitable algorithms.
We found over 2400 different intron sequences that have exact reverse complements or complements elsewhere in the genome. This is an order of magnitude greater than predicted.
Yeast meiotic gene subnetwork
Functional Clustering
Triplex DNA
Vasquez & Wilson (1998)
RNA:DNA binding and RNA:RNA binding rulesare different from canonical Watson-Crick base pairing
- require new search algorithms
Neural network - nodes have multiple inputs and multiple outputs (dataflow computing).
Small world network - optimal balance between local clustering and long distance connections.
Scale free network - nodes have variable levels of connectivity, resistant to random damage.
Dynamical recurrent network (DRN) - the current activation of the network depends
both on the input history of the system (i.e. its existing state) as well as on current inputs. Can be re-set to ground zero (initial state).
Molecular genetic networks and the architecture of biological complexity
Computational modelling of C. elegans evolution and ontogeny using evolutionary backpropagation and recurrent neural network algorithms
QuickTime™ and aGIF decompressor
are needed to see this picture.
50 node network, random initial weights (50% = 0; 0 < 50% < 1)
• Replace with (evolve into) real genome >> accurate simulation of biological evolution, development and variation >> accurate prediction of the phenotypic consequences of genomic variation >> design of new organisms (and other systems capable of self-programming) in silico.
• Increase number of nodes (to 500, 5000, 50000) and require increased functionality (4 dimensional accuracy of cell splits, apoptosis, cell differentiation).
• Redefine nodes in terms of an artificial genome (in an asymmetric cellular milieu) transmitting endogenous signals and receiving both endogenous and exogenous (environmental) signals.
Towards a computational representationof the evolution and development of complex organisms
Bioessays 25: 930-939 (2003)
Acknowledgments
Ryan Taft (UCSD) - ncDNA / genomic DNA ratio
Larry Croft, Michael Gagen, Martin Lercher (Bath) - accelerating regulatory networks
Mike Pheasant - intron conservation and alternative splicing
Stefan Stanley - yeast intron networks
Janet Wiles, Brad Tonkes, Jennifer Hallinan - modelling genetic networks