Assigned reading:

Assigned reading:Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503.D'Haeseleer, P. 2006. What are DNA sequence motifs? Nat Biotechnol 24: 423-425.

Unit 2.5: Genome Annotation, Gene Prediction, and DNA motifs

Objectives:

-learn what is meant by “genome annotation” and why this is important

-understand why some genomic features are relatively easy to annotate, and others are not

-learn the major ways of representing transcription factor binding sites and the advantages and disadvantages of each

(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08(1128)(1496)(2680) (3825) genome projects: 199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress

small: archaebacterium Nanoarchaeum equitans 500 kbBacillus anthracis (anthrax) 5228 kbS. cerivisiae (yeast) 12,069 kbArabidopsis thaliana 115,428 kbDrosophila melanogaster (fruit fly)137,000 kbAnopheles gambiae (malaria mosquito)278,000 kbOryza sativa (rice) 420,000 kbMus musculus (mouse) 2,493,000 kbHomo sapiens (human) 2,900,000 kb

http://www.genomesonline.org/

Genome Sequencing

so what?

Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics

These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions

•high throughput assays•robotics•high speed computing•statistics •bioinformatics

Because of the vast amounts of data that are generated, we need

new approaches

We know the sequence—but can we understand it?

Understanding the genome

Anna Pavlovna's drawing room was gradually filling. The highest Petersburg society was assembled there: people differing widely in age and character but alike in the social circle to which they belonged. Prince Vasili's daughter, the beautiful Helene, came to take her father to the ambassador's entertainment; she wore a ball dress and her badge as maid of honor. The youthful little Princess Bolkonskaya, known as la femme la plus seduisante de Petersbourg, was also there. She had been married during the previous winter, and being pregnant did not go to any large gatherings, but only to small receptions. Prince Vasili's son, Hippolyte, had come with Mortemart, whom he introduced. The Abbe Morio and many others had also come.

To each new arrival Anna Pavlovna said, "You have not yet seen my aunt," or "You do not know my aunt?" and very gravely conducted him or her to a little old lady, wearing large bows of ribbon in her cap, who had come sailing in from another room as soon as the guests began to arrive; and slowly turning her eyes from the visitor to her aunt, Anna Pavlovna mentioned each one's name and then left them.

--Tolstoy, War and Peace

We don’t know the language:


Гостиная Анны Павловны начала понемногу наполняться. Приехала высшая знать Петербурга, люди самые разнородные по возрастам и характерам, но одинаковые по обществу, в каком все жили; приехала дочь князя Василия, красавица Элен, заехавшая за отцом, чтобы с ним вместе ехать на праздник посланника. Она была в шифре и бальном платье. Приехала и известная, как la femme la plus séduisante de Pétersbourg 1, молодая, маленькая княгиня Болконская, прошлую зиму вышедшая замуж и теперь не выезжавшая в большой свет по причине своей беременности, но ездившая еще на небольшие вечера. Приехал князь Ипполит, сын князя Василия, с Мортемаром, которого он представил; приехал и аббат Морио и многие другие.

— Вы не видали еще, — или: — вы не знакомы с ma tante? 2 — говорила Анна Павловна приезжавшим гостям и весьма серьезно подводила их к маленькой старушке в высоких бантах, выплывшей из другой комнаты, как скоро стали приезжать гости, называла их по имени, медленно переводя глаза с гостя на ma tante, и потом отходила.

Все гости совершали обряд приветствования никому не известной, никому не интересной и не нужной тетушки. Анна Павловна с грустным, торжественным участием следила за их приветствиями, молчаливо одобряя их. Ma tante каждому говорила в одних и тех же выражениях о его здоровье, о своем здоровье и о здоровье ее величества, которое нынче было, слава Богу, лучше. Все подходившие, из приличия не выказывая поспешности, с чувством облегчения исполненной тяжелой обязанности отходили от старушки, чтоб уж весь вечер ни


Even if we did, we don’t know the grammar, or punctuation:


annapavlovnasdrawingroomwasgraduallyfillingthehighestpetersburgsocietywasassembledtherepeopledifferingwidelyinageandcharacterbutalikeinthesocialcircletowhichtheybelongedprincevasilisdaughterthebeautifulhelenecametotakeherfathertotheambassadorsentertainmentsheworeaballdressandherbadgeasmaidofhonortheyouthfullittleprincessbolkonskayaknownaslafemmelaplusseduisantedepetersbourgwasalsothereshehadbeenmarriedduringthepreviouswinterandbeingpregnantdidnotgotoanylargegatheringsbutonlytosmallreceptionsprincevasilissonhippolytehadcomewithmortemartwhomheintroducedtheabbemorioandmanyothershadalsocometoeachnewarrivalannapavlovnasaidyouhavenotyetseenmyauntoryoudonotknowmyauntandverygravelyconductedhimorhertoalittleoldladywearinglargebowsofribboninhercapwhohadcomesailinginfromanotherroomassoonastheguestsbegantoarriveandslowlyturninghereyesfromthevisitortoherauntannapavlovnamentionedeachonesnameandthenleftthemeachvisitorperformedtheceremonyofgreetingthisoldauntwhomnotoneofthemknewnotoneofthemwantedtoknowandnotoneofthemcaredaboutannapavlovnaobservedthesegreetingswithmournfulandsolemninterestandsilentapprovaltheauntspoketoeachoftheminthesamewordsabouttheirhealthandherownandthehealthofhermajestywhothankgodwasbettertodayandeachvisitorthoughpolitenesspreventedhisshowingimpatiencelefttheoldwomanwithasenseofreliefathavingperformedavexatiousdutyanddidnotreturntoherthewholeeveningtheyoungprincessbolkonskayahadbroughtsomeworkinagold-


In order to make use of the genome sequence, we need to understand all of its components. Assigning identities and functions to sequences within the genome is called genome annotation.

“With the complete human genome sequence now in hand, we face the enormous challenge of interpreting it and learning how to use that information to understand the biology of human health and disease. The ENCyclopedia Of DNA Elements (ENCODE) Project is predicated on the belief that a comprehensive catalog of the structural and functional components encoded in the human genome sequence will be critical for understanding human biology well enough to address those fundamental aims of biomedical research. Such a complete catalog, or "parts list," would include protein-coding genes, non–protein-coding genes, transcriptional regulatory elements, and sequences that mediate chromosome structure and dynamics; undoubtedly, additional, yet-to-be-defined types of functional sequences will also need to be included.”

Genes (i.e., protein coding)

But. . . only <2% of the human genome encodes proteins

Other than protein coding genes, what is there?

• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)• structural sequences (scaffold attachment regions)• regulatory sequences• non-functional “junk” ?

It’s still uncertain/controversial how much of the genome is composed of any of these classes

The answers will come from experimentation and bioinformatics.

What’s in a genome?

Current human genome annotations can be viewed using the UCSC genome browser, as we saw in Unit 2-4.

The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence.

•pilot phase focused on 30 Mb (~ 1%) of the genome

•international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function

•now in its second phase, extending study to entire human genome

Published by AAAS

The ENCODE Project Consortium Science 306, 636 -640 (2004)

Functional genomic elements being identified by the ENCODE pilot phase

protein-coding genes, non–protein-coding genes

•easier to find than other functional elements

•why?

•genes are transcribed—which means that we can identify them by looking at RNA

•traditionally this has been done by cDNA or EST sequencing, more recently by microarray, SAGE, MPSS, etc.

protein-coding genes, non–protein-coding genes

•we can also find genes ab initio using computational methods

•this is most suited to protein-coding genes

•why?

•protein-coding genes have recognizable features

•open reading frames (ORFs)

•codon bias

•known transcription and translational start and stop motifs (promoters, 3’ poly-A sites)

•splice consensus sequences at intron-exon boundaries

ab initio gene discovery

•Protein-coding genes have recognizable features

•We can design software to scan the genome and identify these features

•Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes

•It’s a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc.

•We tend to do OK finding protein coding regions, but miss a lot of non-coding 5’ exons and the like

ab initio gene discovery—validating predictions and refining gene models

•Standard types of evidence for validation of predictions include:

•match to previously annotated cDNA

•match to EST from same organism

•similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank

(translation works better—why?)

•protein structure prediction match to a PFAM domain

•associated with recognized promoter sequences, ie TATA box, CpG island

•known phenotype from mutation of the locus

Finding non–protein-coding genes

•e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs

•Harder to find than protein-coding genes

•Why?

•often not poly-A tailed—don’t end up in cDNA libraries

•no ORF

•constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect

•So, how do we find these?

Finding non–protein-coding genes

•secondary structure

•homology, especially alignment of related species

•experimentally

•isolation through non-polyA dependent cloning methods

•microarrays

ab initio gene discovery—approaches

Most gene-discovery programs makes use of some form of machine learning algorithm. A machine learning algorithm requires a training set of input data that the computer uses to “learn” how to find a pattern.

Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are artificial neural networks (ANNs) and hidden Markov models (HMMs).

ab initio gene discovery—HMMs

An example state diagram for an HMM for gene discovery is this simplified version of one used by Genescan:

begin gene

region

starttranslation

donor splicesite

acceptorsplicesite

stoptranslation

end gene

region

single exon

exon finalexon

initialexon5’ UTR 3’ UTR

intron

Each box and arrow has associated transition probabilities, and emission probabilities for emission of nucleotides (dotted arrow). These are learned from examples of known gene models and provide the probability that a stretch of sequence is a gene.

A,T,G,C

adapted from Gibson and Muse, A Primer of Genome Science

What about other genomic features?

Other than protein coding genes, what is there?

• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)• structural sequences (scaffold attachment regions)• regulatory sequences• non-functional “junk” ?

We can begin to annotate regulatory sequences such as transcription factor binding sites and cis-regulatory modules.

Control of Gene Expression—Transcription Factors

Transcription factors (TFs) are proteins that bind to the DNA and help to control gene expression. We call the sequences to which they bind transcription factor binding sites (TFBSs), which are a type of cis-regulatory sequence.

Remember from Unit 2-2:

Isalan et al. Biochemistry 37:12026

Transcription factors bind to specific DNA sequences

Usually, binding sites are first determined empirically.

Most transcription factors can bind to a range of similar sequences. We can represent these in either of two ways, as a consensus sequence, or as a position weight matrix (PWM).

Once we know the binding site, we can search the genome to find all of the (predicted) binding sites.

Most transcription factors can bind to a range of similar sequences. We call this a binding “motif.”

Wasserman, W. W. and A. Sandelin (2004). Nat Rev Genet 5(4): 276-287.

We can represent these motifs either as a consensus sequence or as a frequency (or weight) matrix.

Control of Gene Expression—Transcription Factors

Binding site (motif) representationsTCCGGAAGCTCCGGATGCTCCGGATCTCATGGATGCCCAGGAAGTGGTGGATGCACCGGATGC

TCCC

TGGAAGC

A 111007200T 302000502G 110770060C 254000015

7 characterized binding sites for a certain transcription

factor:consensus sequence:

Frequency matrix and its graphical

depiction, a sequence logo:

Binding site (motif) representations

A consensus sequence is a one-line description of the TFBS, based on a column-by-column alignment of the individual known binding sites. The usual rule is:A single base is shown if it occurs in more than halfthe sites and at least twice as often as the secondmost frequent base. Otherwise, a double degeneratesymbol (e.g., G/C= S) is used if two bases occurin more than 75% of the sites, or a triple degeneratesymbol when one base does not occur at all.

A frequency matrix shows the actual frequencies of each base in each column. This can be easily converted to a position weight matrix (PWM), which is a normalized version of the frequency matrix that is therefore not dependent on the number of sites in the alignment.

Consensus sequences make searching easy—it’s a simple text search that can even be done using a word processor, or very simply programmed in a computer language such as Perl:

while(<SEQUENCE>){if ($_ =~ /[T|C]C[T|C]GGATGC/)

{do something;}}

All positions in the motif are treated the same.

Finding binding sites in the genome

TCCC

TGGATGC

Identifying transcription factor binding sites

But PWMs are generally more useful:

•they allow us to assign more importance to more invariant positions

•they are related to the binding energy of the DNA-protein interaction

•we can compare PWMs and we can score PWMs

Scores are based on the probability of a given nucleotide being in a given position.


A 1 1 1 0 0 7 2 0 0T 3 0 2 0 0 0 5 0 2G 1 1 0 7 7 0 0 6 0C 2 5 4 0 0 0 0 1 5

Example 1:TCCGGAAGC scores higher than TCCGGAACT scores higher than TCCGGAAAA as GC > CT > AA in the last two positions. Note that the latter two sequences would score the same if using only the consensus representation.

TC C C

T G G A T G C

TCCGGAAGC

TCCGGAACT

TCCGGAAAA


A 111007200T 302000502G 110770060C 254000015

TC C C

T G G A T G C

Example 2:TCGGGAAGC and TCCAGATCT both have a single mismatch compared to the consensus. But the first is a much better binding site when scored using a PWM due to the strong conservation of the G in position 4 versus the weak requirement for the C or T in position 3.

Issues with finding binding sites in the genome

But it’s important to use caution: just because a sequence in the genome is a reasonable match to a known TFBS, this doesn’t necessarily mean that the TF is binding there in vivo. By crude calculation:The probability of finding a 7 bp motif is 4-7 = 1/16,384i.e., expect only about 1 motif every 16 kb.

So in human genome, this sequence should be present over 183,000 times! (>7x per gene!) Even in a 10 Mb genome, the sequence would occur over 600 times.

And this calculation does not even take into account motif degeneracy!So we need to consider additional factors in deciding what predicited binding sites are important—such as how regulatory regions are organized

Empirical methods, such as ChIP-chip (see Unit 2-3) are a good alternative for looking at in vivo binding; bioinformatics methods can be combined with this to determine the transcription factor binding motifs.

Because of the difficulty in accurately predicting bona fide, functional TFBSs, most current genome annotation focuses on empirically determined sites. Several databases curate these data, e.g. the Open Regulatory Annotation database (ORegAnno) and the Regulatory Element Database for Drosophila (REDfly). Tracks displaying these data can be found in the UCSC Genome Browser. These databases also curate cis-regulatory module sequences, which at present can only reliably be determined by empirical methods.

Genome Annotation—Transcription Factors

http://www.oreganno.org/oregano/Index.jsp

http://redfly.ccr.buffalo.edu/

http://genome.ucsc.edu/

Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs, CRMs, and other non-gene features have so far been indentified.

Genome Annotation—much work remains

Documents

Assigned reading: