37
Genome Annotation BBSI July 14, 2005 Rita Shiang

Genome Annotation BBSI July 14, 2005 Rita Shiang

Embed Size (px)

Citation preview

Page 1: Genome Annotation BBSI July 14, 2005 Rita Shiang

Genome Annotation

BBSI

July 14, 2005

Rita Shiang

Page 2: Genome Annotation BBSI July 14, 2005 Rita Shiang

Genome Annotation

Identification of important components in genomic DNA

Page 3: Genome Annotation BBSI July 14, 2005 Rita Shiang

What is a Gene?

Fundamental unit of heredity DNA involved in producing a polypeptide; it

includes regions preceding and following the coding region (leader and trailer) as well as intervening sequences (introns)

Entire DNA sequence including exons, introns, and noncoding transcription-control regions

Page 4: Genome Annotation BBSI July 14, 2005 Rita Shiang

What Components are Important in Protein Coding Genes?

Sequences that initiate transcription Sequences that process hnRNA to mRNA Signals important in translation

Page 5: Genome Annotation BBSI July 14, 2005 Rita Shiang

TATA Box

Lodishet al, Molecular Cell Biology, 2000, Fig. 10.30.

Page 6: Genome Annotation BBSI July 14, 2005 Rita Shiang

Other Promoters

Initiator consensus– 5’Py Py A(+1) N T/A Py Py Py

N = A, T, G or C Py = pyrimidine = C or T

GC rich sequences– Stretch of 20-50 GC nucleotides ~100 bp upstream

of start site (CpG not common in genome)– Housekeeping genes– Multiple initiation sites

Page 7: Genome Annotation BBSI July 14, 2005 Rita Shiang

Polyadenylation & Cleavage

Addition of a string of As to mRNAs Polyadenylation signal AAUAAA found before

cleavage site GU or UU rich region ~50 bp from the cleavage

site Stabilizes mRNA transcripts

Lodishet al, Molecular Cell Biology, 2000, Fig. 11.23.

Page 8: Genome Annotation BBSI July 14, 2005 Rita Shiang

Splicing

Lodishet al, Molecular Cell Biology, 2000, Fig. 11,13.

Electron micrograph of adenovirus DNA and hexon gene mRNA

Page 9: Genome Annotation BBSI July 14, 2005 Rita Shiang

Splice Reaction

Lodishet al, Molecular Cell Biology, 2000, Fig. 11.15.

Page 10: Genome Annotation BBSI July 14, 2005 Rita Shiang

Splice Sites

Lodishet al, Molecular Cell Biology, 2000, Fig. 11,14.

Page 11: Genome Annotation BBSI July 14, 2005 Rita Shiang

Additional Splice Sites

Consensus Py7NCAG-G(exon)AG – GUAAGU 98.12%Nonconsensus

GCU12 introns AC PuUAUCCUPy 0.76%Other rare sequences 1%

Py = C or UPu = A or G

Page 12: Genome Annotation BBSI July 14, 2005 Rita Shiang

Translation Signals

5’ Cap structure directs ribosomal binding AUG codes for methionine. The first AUG in a

transcript is where translation starts Open reading frame (ORF)

– Stretch of sequence that codes for amino acids before a stop codon

Translation stop codons UAG, UAA, UGA

Page 13: Genome Annotation BBSI July 14, 2005 Rita Shiang

Capping of 5’RNA with 7’-methylguanylate (m7G)

Lodish et al, Molecular Cell Biology, 2000, Fig. 11.8.

Page 14: Genome Annotation BBSI July 14, 2005 Rita Shiang

Known Gene Components

Lodishet al, Molecular Cell Biology, 2000, Fig. 10.34.

Page 15: Genome Annotation BBSI July 14, 2005 Rita Shiang

Genome Annotation

What is in a genome besides protein coding genes?

Page 16: Genome Annotation BBSI July 14, 2005 Rita Shiang

Repetitive DNA makes up at least 50% of the genome

Transposon-derived interspersed repeats Inactive retroposed copies of genes –pseudogenes Simple short repeats Segmental Duplications Blocks of tandemly repeated sequences

– Centromeres– Telomeres– Short arm of acrocentric chromosomes– Ribosomal gene clusters

Page 17: Genome Annotation BBSI July 14, 2005 Rita Shiang

Non-protein coding genes or non-coding RNA (ncRNA)

tRNA genes rRNA genes snRNA genes

– Splicing– Telomere maintenance

snoRNA genes Other

– microRNA

Page 18: Genome Annotation BBSI July 14, 2005 Rita Shiang

Annotation of Genomic DNA

Identifying Protein Coding Genes Placing the genes on the genome (where are

they?)

Page 19: Genome Annotation BBSI July 14, 2005 Rita Shiang

How Many Genes in the Genome?

Early on based on reassociation kinetics the estimate was ~40,000

Walter Gilbert estimated ~100,000 based on gene and genome size

70,000 – 80,000 based on an extrapolated number of CpG islands

With the Human sequence the estimate is 30,000 – 40,000

Page 20: Genome Annotation BBSI July 14, 2005 Rita Shiang

Annotation of Genomic DNA Specifically for Genes that Code for Proteins

Match genomic DNA to genes that have been previously cloned and sequenced looking for sequence similarity using BLAST programs

Predict genes using computer programs to scan genomic DNA using known elements

Many strategies use a combination of both methods

Page 21: Genome Annotation BBSI July 14, 2005 Rita Shiang

Lodishet al, Molecular Cell Biology, 2000, Fig. 7.14

cDNA Library Construction

Page 22: Genome Annotation BBSI July 14, 2005 Rita Shiang

Lodishet al, Molecular Cell Biology, 2000, Fig. 7.15

Page 23: Genome Annotation BBSI July 14, 2005 Rita Shiang

Gene AnnotationCelera

Constructed gene models using sequence from cDNAs

Used Unigene database Partitions GenBank sequences (mRNAs & ESTs) into non-

redundant set using 3’ UTRs 111,064 Unigene clusters for human

Page 24: Genome Annotation BBSI July 14, 2005 Rita Shiang

Gene AnnotationCelera cont.

Predicts gene boundaries by identifying overlapping sets of EST and protein matches

Known full-length genes were annotated on the map (matched w/50% of the length & >92% identity)

Clusters that did not match a full-length gene were evaluated using other references

– Conservation of genomic sequence between mouse & human– Similarity between human & rodent transcripts– Similarity to known proteins

Page 25: Genome Annotation BBSI July 14, 2005 Rita Shiang

Validation

Validated by construction of known genes (RefSeq)

6.1% of RefSeq genes were not annotated by Otto

Page 26: Genome Annotation BBSI July 14, 2005 Rita Shiang

Gene Annotation - Human Genome Sequencing Consortium

Start with Ensemble predicted genes– ab initio predictions using Genscan

Based on probabilistic model of genome sequence composition and gene structure

– Confirm similarity to mRNAs, ESTs, protein motifs from all organisms

– Extend protein matches using GeneWise Compares protein based information to genomic sequence

and allows for frameshifts and large introns

– Produces partial gene predictions

Page 27: Genome Annotation BBSI July 14, 2005 Rita Shiang

Consortium cont.

Merge Ensemble gene predictions w/ Genie predictions– Genie identifies matches of mRNAs and ESTs

Employs hidden Markov models (HMMs) to extend matches using ab initio statistical methods

Links information from 5’ and 3’ ESTs from the same cDNA clone to complete a sequence from the ATG to the stop codon

Can generate alternatively spliced products (though only longest used in this build)

Merge results with genes in RefSeq, SWISSPROT and TrEMBL databases

Page 28: Genome Annotation BBSI July 14, 2005 Rita Shiang

Validation

Validate method by comparing to a new set of known genes, a set of mouse cDNAs and genes on Chromosome 22 (Finished Sequence)

85% Sensitivity 13% spurious predictions

Page 29: Genome Annotation BBSI July 14, 2005 Rita Shiang

Factors Affecting Gene Annotation

Splice sites do not conform to consensus Noncoding exons are common

– Exon – what is left over after splicing after introns are removed and does not refer to a stretch of coding information

– tRNAs are spliced but noncoding– >35% of human genes have noncoding exons– No statistical bias so they are difficult to identify

Page 30: Genome Annotation BBSI July 14, 2005 Rita Shiang

Factors Affecting Gene Annotation Cont.

Internal exons can be very small– Avg. size of internal exons are ~130 bp– ~65% of vertebrate exons are 68-208 bp– >10% are <60 bp– Exons < 10 bp have been identified– Invected gene in Drosophila

One of four exons is 6 bp (GTCGAA) Flanked by introns of 27.6 and 1.1 kb Not correctly recognized by cDNA alignment software and creates a

frameshift in the gene– Exons of size 0

Resizing exons create an intermediate splice product

Page 31: Genome Annotation BBSI July 14, 2005 Rita Shiang

Places to View Annotated Genomes

National Center for Biotechnology Information (NCBI)

Ensemble The Golden Path (UCSC Genome Browser) Celera

Page 32: Genome Annotation BBSI July 14, 2005 Rita Shiang

Verification of Annotation in C. elegans by Experimentation

Complete genomic sequence Small introns Small intergenic regions

Page 33: Genome Annotation BBSI July 14, 2005 Rita Shiang
Page 34: Genome Annotation BBSI July 14, 2005 Rita Shiang

Results

11,984 cDNAs successfully cloned out of a prediction of 19,477

4,365 were not represented by cDNAs or ESTs Failure of cloning could be due to:

– Wrongly predicted exons– Very low expressing genes– Not a real gene

Page 35: Genome Annotation BBSI July 14, 2005 Rita Shiang

Verification of intron/exon structures

Page 36: Genome Annotation BBSI July 14, 2005 Rita Shiang

Comparison of a Single Transcript

Page 37: Genome Annotation BBSI July 14, 2005 Rita Shiang

Greater than 50% of intron/exon structures need correcting?