Ensembl. Going beyond A,T, G and C Ewan Birney. There is more to life than proteins (but not much)...

Preview:

Citation preview

Ensembl. Going beyond A,T, G and C

Ewan Birney

There is more to life than proteins(but not much)

Ensembl

ENCODE

Reactome

Human/Mouse

ReconcilewithGenome

Project orthologousproteins onto genome

Human

Mouse/OtherMammals

Increase in quality

0%

20%

40%

60%

80%

100%

Human-UniSw-33Human-UniSw-34Human-UniSw-35Human-RefSeq-33Human-RefSeq-34Human-RefSeq-35Mouse-UniSw-30Mouse-UniSw-32Mouse-UniSw-33Mouse-UniSw-34Mouse-RefSeq-30Mouse-RefSeq-32Mouse-RefSeq-33Mouse-RefSeq-34

Missing

Matching

Edge perfect

Identical

Chicken

ReconcilewithGenome

Project orthologousproteins onto genome

Chicken

Human

Mouse

Chicken• Extant dinosaur lineage• Split from mammals 300 Mya• Neutral rate of 1.5

substitutions per base• No pseudogenes• Good synteny to human

• Tested Ensembl Gene Build:– 90% Perfect exon boundary

prediction

– 4% within 10 base pairs

– 85% sensitivity

StickleBack• “close” to

Fugu/Tetraodon• 21,135 Genes• 97% Gene Loci

sensitivity (held out cDNAs)

• 87% exact exon prediction, 6% overlapping

• 63% of cDNAs had a perfect prediction without cDNA evidence

Human

MouseRat

Fugu, IMCB

Tetraodon, GENOSCOPE

Zebrafish

C. savignyi *

Fruitfly, FLYBASE

Malaria mosquito, VECTORBASE

C. elegans WORMBASE

Medaka

Rhesus macaqueChimpanzee

DogCow

Chicken

Xenopus

C. intestinalis

Fever mosquito*, VECTORBASE

523

41

91

83

310

92

360

450

990 25

70

140

?

550

25070?

1002003004005001000

Million years

19 species currently in Ensembl8 to be added by the end of the year* already in pre-site

Honey bee

340

Yeast, SGD

Opposum

170

1500?

?

Stickleback

Armadillo *

Elephant *

Tenrec *

105

?

Rabbit *95

?

Chordata

Vertebrata

AmniotaTetrapoda

Mammalia

Eutheria

Teleostei

Urochordata

Arthropoda

Nematoda

Fungi

Aves

Amphibia

Metatheria

Example of the Insulin clusterand data flattening

Duplication nodeSpeciation node or leaf

one2one

one2one

one2many

many2many

apparentone2one

Gene tree : 1st data assessment

Good concordance with the classical BRH/RHS paired species approach (RHS are based on gene order conservation)

Find more complex one-to-many and many-to-many relations

To do : compare with ~1000 curated trees from TreeFam

RHS BRH NEW

many2many 177 113 1,439

one2many 725 1,309 2,815

one2one 205 10,736 109

apparent one2one

78 1,571 104

lost 2,027 2,060

BRH NEW

many2many 170 1,599

one2many 1,870 4,563

one2one 880 80

apparent one2one

2,040 241

lost 620

Human/Mouse Human/Drosophila

19,001 5,580

11

,44

3

19

,38

1

Example of AlignSliceView between Human/Mouse/Rat/Dog with MLAGAN

Transcript SNP View

Ensembl OutreachEnsembl Outreach

How do you get it?• www ensembl org

– Pretty pictures for genomes and genes– Web based data mining

• Open MySQL server - ensembldb– Script across the internet in Perl, Java or Python– 100% consistent semantics between genomes

• Extend via DAS– At genome, protein or “gene” levels

• Full download– Extend in house, run in-house DAS servers

• Send someone to us (geek for a week)• Bring over Xose to run a course (only travel costs need to

be covered)• Email helpdesk@ensembl.org for more info.

The ENCODE project

1% of the human

The Kitchen Sink of experimental methods

Protein coding loci are far more complex than we think

• On average 5 transcripts per locus

• Many do not encode proteins (as far as we can see)

• Even the ones which do encode proteins, many of these proteins look “weird”

a inactive, "stressed"

(d) (e)

b active (beta inserted)(c)

(f)

The Clade B Serpins PotentialMissing fragments

Parsing the regulatory code

PolII

Myc

E2F2

H3K4Me3

Chromatin marks, Polymerase

In vivo Transcription Factors

Nimblegen

Data

Import API

Client

ExportAPI

FuncGen DB(Archive?)

Mirror

Tab2MAGE

MAGE-ML

?

AnalysisPipeline

ProcessedData

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Browsers:Wiggle PlotHistone GlyphsRaw Data?

DAS

FuncGen DB(& Results?)

Export API

Import API

Web API?

FuncGen Results

DB?

Import API

API

Local

Reactome

Pathways…

Insulin binds the insulin receptor, causing it todimerise. The dimerised form the autophosphorylateson 6 cytoplasmic tyrosines. This phosphorylated form recruits the IRS adaptor....

Reactome data model

InsulinPeptide

InsulinReceptor

ReactionInsulin Receptordimer complex

Insulin Receptordimer complex,P-Tyr on 67...

Reaction

GO:phosphorylation

CatalystActivity

PubMed:1543

PubMed:5623Insulin Receptor Signalling

x2

Lineage Deletion rates

Trp Catabolism

Head or Tail

DNA Repair

Redundant Paths

Insulin Signalling

Pathway modules

Back to Proteins

Proteins are the natural Hub

Variation

Pathways

Regulation

Structures Literature

Genome

Proteins

Thanks: Ensembl

Leaders Ewan Birney (EBI), Tim Hubbard (Sanger Institute)

Analysis and Annotation Pipeline

Val Curwen, Steve Searle, Browen Aken, Juilo Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Simon White

Database Schema and Core API

Glenn Proctor, Ian Longden, Craig Melsopp, Patrick Meidl

BioMartArek Kasprzyk, Syed Heider, Richard Holland, Damian Smedley

Distributed Annotation System (DAS)

Andreas Kähäri, Eugene Kulesha

OutreachXosé M Fernández, Bert Overduin, Michael Schuster, Giulietta Spudlich

Web TeamJames Smith, Fiona Cunningham, Anne Parker, Stephen Rice, Steve Trevanion, Matt Wood

Comparative GenomicsAbel Ureta-Vidal, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Javier Herrero, Albert Vilella

Functional Genomics

+ VariationPaul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios

Zebrafish Annotation Kerstin Jekosch, Mario Caccamo

Systems & Support Guy Coates, Tim Cutts

Thanks: Reactome and Consortia

Reactome EBI and CSHL

Reactome @ EBIEwan Birney, Imre Vastrik, Esther Schmidt, Bernard de Bono, Bijay Jassal

Reactome @ CSHLLincoln Stein, Peter D’Eustauchio, Gopal Gopinathrao, Guaming Wu, Lisa Matthews, Marc Gillispie

ENCODE 40 groups worldwide

Leaders:

Zhiping Weng (BU), Mike Snyder (Yale), John Stam. (U. Wash), Roderic Guigo (Barcelona), Tom Gingeras (Affy), Elliott Marguilles (NIH), Anindya Dutta (Duke), Manolis Dermzakalis (Sanger)

BioSapiens 20 groups across Europe

Structural work of ENCODE

Alfonso Valencia (Madrid), Michael Trees (Madrid), Janet Thornton (EBI) Gabby Logan (EBI)