Genomic programming of complex organisms: the hidden layer of noncoding RNA

Genomic programming of complex organisms:

the hidden layer of noncoding RNA

Numbers of protein coding genes do not scale strongly with complexity: - fly and worm protein-coding gene numbers (14-19,000) are only about 2-3

times those of yeast (6,000) and P. aeruginosa (5,500)- mammalian (human, mouse) protein-coding gene numbers (~30,000) are

only about twice those of invertebrates (and less than some plants).

The genetic basis of eukaryotic complexity and variation

The relative amount of noncoding DNA does scale with complexity

Vertebrates

Ciona (chordate)

Invertebrates

Plants

Complex fungi

Simpler eukaryotes

Prokaryotes

Either the human genome is replete with useless transcription or these non-protein-coding RNAs are fulfilling some unexpected function

~ 98% of the transcriptional output in humans is noncoding RNA

- around 95-97% of the primary transcript of protein coding genes is intronic

- there are enormous numbers of noncoding RNA genes in the mammalian genome, which are only now beginning to be recognised,

and which appear to account for between 1/2 and 3/4 of all transcripts

At least 50% and possibly the majority of the human genome is transcribed

- 1.5% protein coding indicates that a minimum of 30% of the genome is transcribed (x20)

- if equal number of noncoding RNA transcripts, then over 60% is transcribed

The genetic basis of human complexity and variation

The central dogma states that genetic information flows from DNA to RNA to protein

This is usually interpreted to mean that genetic information flows from DNA to protein (via mRNA), i.e. genes are generally synonymous with proteins

This is true in prokaryotes, whose genomes are comprised (75-95%) of wall-to-wall protein coding sequences flanked by 5’ and 3’ regulatory elements.

It has been assumed that the same applies in eukaryotes…..with the logical extension (i) that the increased complexity of eukaryotes is explained by the combinatorics of regulatory factors intersecting with more complex promoters etc., with the corollary (ii) that non-protein-coding sequences in eukaryotic genomes (98.5% in humans) are either cis-regulatory elements or evolutionary junk.

Thus proteins comprise not only the functional and structural components of cells, but are also the agents by which the system is regulated in conjunction with cis-elements and environmental signals.

These assumptions are now articles of faith….but they are not necessarily correct.

The assumption that genes are synonymous with proteins (and that proteins are all that is required to program a complex organism) has led to all sorts of subsidiary assumptions, notably that introns (because they do not encode protein and despite the fact that they are transcribed), do not transmit genetic information into the system (as RNA molecules).

If one considers the reasonable alternative, i.e. if intronic RNA is functional, then there are a number of logical extensions:

3. These RNAs must be processed post-splicing and be transmitting information via RNA-DNA, RNA-RNA and RNA-protein interactions, presumably sequence-specifically. This equates to a quasi-digital feed-forward regulatory system that would, in theory, permit integration of complex suites of gene activity and regulatory regimes, and crucially the programming of gene expression profiles throughout the trajectories of differentiation and development.

1. Genetic information is being expressed both as RNA and as proteins, and intronic RNA is transmitting secondary information in parallel with protein-coding sequences, which must be involved in networking of gene activity. Thus the genetic operating system is different between eukaryotes and prokaryotes.

2. Some, perhaps many genes, will have evolved only to express RNA signals.

1. The majority of the genomic sequence in the higher organisms (the non-protein-coding DNA) is devoted to the control of developmental programming.

2. The majority of the regulatory transactions during development in the higher organisms are conveyed by RNAs, not proteins, although the two classes of regulatory controls work in concert.

3. The combinatorics of protein regulators intersecting with environmental signals in itself provides insufficient state information for the programming of differentiation and development. Rather the importance of protein signaling is to provide contextual cues to guide and to tune the (RNA-directed) endogenously programmed pathways by providing positional information and correcting stochastic errors.

4. The complexity of prokaryotes has been limited throughout evolution not by biochemical or environmental factors, or cell structure, but by a primitive regulatory system based on proteins alone.

Prokaryotic gene

mRNA

protein

Eukaryotic gene

mRNA and/or eRNA

catalytic function structural role regulation

protein

networking functions

catalytic function structural role regulation

SINGLE OUTPUTSIMPLE OPERATING SYSTEM

MULTIPLEX OUTPUTPARALLEL PROCESSING

Hidden layer

• 98% of all transcription in humans is ncRNA, and over half of the human genome is transcribed

• There are 20,000 pseudogenes in the human genome - many of which are transcribed, and at least one of which has function (as an RNA, regulating the expression of its protein-coding homolog).

• There are huge blocks of conservation in introns and intergenic sequences

• At least 60% of all human genes have associated antisense transcripts - mechanism of imprinting, transvection and inter-allelic communication? Other?

• Introns and ncRNAs are processed to microRNAs

• All well studied gene loci in animals have a majority of noncoding transcripts (callipyge, IGFR, Gnas, globin, bithorax etc.).

• At least 50% of mouse cDNAs do not appear to encode protein, and whole transcript mapping of human chromosomes shows an order of magnitude higher numbers of transcripts than expected from protein-coding sequences

• A wide range of complex genetic phenomena in eukaryotes (RNAi, co-suppression, transvection, transinduction) are directed by RNA.

Aparicio et al. Science 297, 1301-1310 (2002)

Not all introns have evolved function, as each intron is the descendant of an independent insertion event and is evolving independently, albeit in the context of their host transcript - as exemplified by the intron distribution in Fugu rubripes.

Multiple microRNAs (miRs) are produced by processing of “intergenic” noncoding RNAs and introns

Lagos-Quintana et al. Science 294, 853-858 (2001)Lau et al. Science 294, 858-862 (2001)

Lee and Ambros Science 294, 862-864 (2001)

“Glimpses of a tiny RNA world”Ruvkun Science 294, 797-799 (2001)

Imprinted micro RNA genes transcribed antisense to a reciprocally imprinted retrotransposon-like gene. Seitz et al. (2003) Nature Genetics 34, 261-262.

Control of alternative splicing

Functions of trans-acting RNA signals

Epigenetic modification

Transcriptional regulation

Control of mRNA turnover

Control of translation

Signal transduction

RNA modification and editing

gene (transcription unit / cluster)

exons + introns

protein

catalytic functions structural roles signal transduction and regulation of gene expression

mRNA or ncRNA

transcription

primary transcript

splicing

other functions

processing

snoRNAsmicroRNAs

assembly

networking

Revised definition of gene and flow of genetic information

Rnomics

Definition of transcriptome and the dynamic expression of ncRNA genes in differentiation and development: mammalian ncRNA

database and development of ncRNA chips

Computational analysis of homology patterns in noncoding sequences within and between genomes: evolutionary constraints, network wiring,

mechanistic clues, selection of candidates for molecular genetic analysis. Genetic and molecular genetic analyses in model organisms: yeast, C.

elegans, Drosophila and vertebrates: proof-of-principle experiments to show that introns are transmitting genetic signals, genetic and

biochemical resource development to test mechanism.

Modelling of eRNA genetic networks. Proof that regulatory networks are accelerating networks and that prokaryotes have been limited by

regulatory overhead.

The numbers of regulators must rise as a non-linear (quadratic) function of the number of regulons (genes or co-regulated modules of genes) - in prokaryotes “operons”, in eukaryotes (mostly) individual genes, and splice variants thereof.

[r] = n2

Each new regulon requires at least one new regulator (or regulatory combination), and an additional higher order regulator, depending on the degree of required connectivity (i.e. coordination with) other genes or suites of genes in the network. The existing regulatory network of the cell has also to be expanded to integrate the activity of the new regulon, if the system is not to be become disconnected.

Accelerating regulatory networks

The complexity of any organism (and indeed of any organized system) must ultimately be limited by its regulatory overhead, whose ceiling can only be breached by a fundamental change in the regulatory architecture.

Prokaryotic genomes have maximum size of ~12Mb

Prokaryotic complexity is limited by regulatory overhead

The numbers of regulators in prokaryotes scales as a square function of genome size (number of regulons).

Larry Croft, Martin Lercher, Michael Gagen and John Mattick, http://au.arxiv.org/abs/q-bio.MN/0311021

R = 0.0000163 N1.96 (95% confidence limit 1.81 - 2.11)

Prokaryotic genomes have maximum size of ~12Mb

Prokaryotic complexity is limited by regulatory overhead

The numbers of regulators in prokaryotes scales as a square function of genome size (number of regulons).

Additional regulatory strategies in eukaryotes:

Combinatorics?

Chromatin modification

Alternative splicing

RNA?

Advantages: ~I.5 log reduction in coding requirement + sequence specificity

Co

mp

lexi

ty Multicellular world

Unicellular world

eubacteria

archaea

plants fungi

animals

-4,000 -3,000 -2,000 -1,000 present

Time (mya)

single-celled eukaryotes

A simplified biological history of the Earth

(protista)

Actual landscape

Simulated landscape

A large fraction of the mammalian genome is under evolutionary

selection

Control of alternative splicing

Functions of trans-acting RNA signals

Boundary sequences across intron exon junctions of alternatively spliced exons are conserved (e.g. WT1)

- massive changes in splicing patterns during differentiation

- mechanism of control of alternative splicing unknown

- cis-acting regulatory sequences frequently mapped close to intron-exon junction

- antisense oligoribonucleotides directed against splice junctions can alter splicing patterns in cultured cells and transgenic mice

WT1 gene – nucleotide sequences around KTS alternative splice site

EXON 8/8a EXON 9

R S D H L K T H T R T H T G K T S E K P F S C R W P S C Q K Khuman CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCCAAGTTGTCAGAAAAAGTTTGCmouse CGGTCCGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCACAGTTGTCAGAAAAAGTTTGCrat CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCACAGTTGTCAGAAAAAGTTpig CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGTCGGTGGCCCAGTTGTCAGAAAAAGTTTGCdunnart CGGTCTGACCACCTGAAGACACACACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCCTTCAGCTGCCGGTGGCCCAGTTGTCAGAAAAAATTTGCchicken AGATCTGATCATCTGAAGACTCATACCAGGACTCATACAGGTAAAACAA GTGAAAAGCCTTTCAGTTGCCGATGGCCCAGCTGTCAAAAAAAAturtle AGATCTGATCATCTGAAGACTCATACCAGGACTCATACAGGTAAAACAA GTGAAAAACCATTCAGCTGTAGATGGCCCAGCTGTCAAAAAAAATTTalligator AGGTCTGATCATCTAAAGACTCACACCAGGACTCATACAGGTAAAACAA GTGAAAAACCATTCAGCTGTCGATGGCCCAGCTGTCAAAAAAAATTTGCnewt CGATCTGACCATTTGAAGACACACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCTTTTAGCTGCCGATGGCCCAGTTGTCAAAAGAAATTTGCxenopus AGGTCCGACCACCTGAAGACTCACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCCTTTAGCTGCAGGTGGCCAAGTTGCCAGAAGAAGTTTGCeel CGATCTGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA GTGAGAAGCCATTCACCTGCAGGTGGCCAAACTGTCAGAAGAAGTTTGC

EXO N 8/8a INTRO N 8

human CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-cattatttttttaaactatmouse CGGTCCGACCATCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacgtttattttttcattatttt-ctaagctacrat CGGTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattctt-cattatttt-ctaagctacdog CGGTCTGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-cattatttttttaaaaaaacow CGTTCCGACCACCTGAAGACCCACACCAGGACTCATACAGGTAAAACAA gtgcgtaaacttttcttcacatttattttt-ctttatttttttaaactac

zeb 1 CGTTCAGACCACCTTAAGACCCACACCCGGACACATACAGGTAAAACAA gtgcgtaaaccttttcatttttttcatgattccctcctctctttcactczeb 2 CGCTCGGACCACCTGAAGACACACACGCGGACTCATACAGGTAAAACAA GTGCgtaagactcctctttcaaacctctcattaaccctttcttcccttafugu 3 AGGTCTGACCATCTTAAGACACACACCCGGACTCATACAGGTAAAACAA gtgcgtaaacttttattttttccttttcttcttcacccacatgtttacacfugu 2 CGCTCCGACCACCTTAAGACGCACACTCGGACTCATACAGGTAAACCAA gtgcgtacctttatttcctacttctccattatcagcttgttgttatttt

Conservation of nucleotide sequences around alternative splice sites

Bioinformatic prediction of putative RNA signalling networks in yeast

Genetic evidence suggests that most RNA-mediated regulation and signalling in the eukaryotes is homology-dependent.

If intronic and other noncoding RNAs are functioning as trans-acing signals, at least a subset of their targets should be detectable by homology elsewhere in the genome (operating either at the RNA or DNA level, although the rules for sequence recognition may vary).

We have instituted a bioinformatic search of the yeast genome for intron sequences, focussing on those of length ≥ 16 nucleotides, that have exact matches elsewhere. Yeast was selected as a training set for development of suitable algorithms.

We found over 2400 different intron sequences that have exact reverse complements or complements elsewhere in the genome. This is an order of magnitude greater than predicted.

Yeast meiotic gene subnetwork

Functional Clustering

Triplex DNA

Vasquez & Wilson (1998)

RNA:DNA binding and RNA:RNA binding rulesare different from canonical Watson-Crick base pairing

- require new search algorithms

Neural network - nodes have multiple inputs and multiple outputs (dataflow computing).

Small world network - optimal balance between local clustering and long distance connections.

Scale free network - nodes have variable levels of connectivity, resistant to random damage.

Dynamical recurrent network (DRN) - the current activation of the network depends

both on the input history of the system (i.e. its existing state) as well as on current inputs. Can be re-set to ground zero (initial state).

Molecular genetic networks and the architecture of biological complexity

Computational modelling of C. elegans evolution and ontogeny using evolutionary backpropagation and recurrent neural network algorithms

QuickTime™ and aGIF decompressor

are needed to see this picture.

50 node network, random initial weights (50% = 0; 0 < 50% < 1)

• Replace with (evolve into) real genome >> accurate simulation of biological evolution, development and variation >> accurate prediction of the phenotypic consequences of genomic variation >> design of new organisms (and other systems capable of self-programming) in silico.

• Increase number of nodes (to 500, 5000, 50000) and require increased functionality (4 dimensional accuracy of cell splits, apoptosis, cell differentiation).

• Redefine nodes in terms of an artificial genome (in an asymmetric cellular milieu) transmitting endogenous signals and receiving both endogenous and exogenous (environmental) signals.

Towards a computational representationof the evolution and development of complex organisms

Bioessays 25: 930-939 (2003)

Acknowledgments

Ryan Taft (UCSD) - ncDNA / genomic DNA ratio

Larry Croft, Michael Gagen, Martin Lercher (Bath) - accelerating regulatory networks

Mike Pheasant - intron conservation and alternative splicing

Stefan Stanley - yeast intron networks

Janet Wiles, Brad Tonkes, Jennifer Hallinan - modelling genetic networks

Institute for Molecular BioscienceUniversity of Queensland

Brisbane

[email protected]

Documents

Genomic programming of complex organisms: the hidden layer of noncoding RNA