22
The Havana-Gencode annotation GENCODE CONSORTIUM

The Havana-Gencode annotation GENCODE CONSORTIUM

Embed Size (px)

Citation preview

Page 1: The Havana-Gencode annotation GENCODE CONSORTIUM

The Havana-Gencode annotation

GENCODE CONSORTIUM

Page 2: The Havana-Gencode annotation GENCODE CONSORTIUM

Region name Region typeRegion

length (kb)Chr known Novel protein

Novel transcript

PutativePseudogene(processed)

Pseudogene(unprocessed)

Artefact TEC

ENr114 Random 500 10 1 0 1 0 0 0 0 0

Enr231 Random 500 1 12 0 0 2 1 0 0 0Enr232 Random 500 9 9 0 3 4 1 0 1 0Enr324 Random 500 X 3 0 2 1 4 0 0 0ENr334 Random 500 6 8 0 1 1 1 0 0 2Enr323 Random 500 6 5 0 0 2 5 0 0 0Enr111 Random 500 13 1 1 3 3 2 0 0 2Enr222 Random 500 6 3 0 1 2 1 0 0 0ENr132 Random 500 13 4 0 1 3 1 0 0 0Enr333 Random 500 20 16 0 1 4 3 1 0 0

ENm004 Manual 1700 22 19 0 3 5 9 4 1 0ENm006 Manual 1338,447 X 42 5 1 1 5 4 1 2ENr223 Random 500 6 6 3 2 2 7 0 0 0Enm002 Manual 1000 5 21 1 4 7 0 2 2 0ENr112 Random 500 2 0 0 0 0 1 1 0 0ENr113 Random 500 4 0 0 2 0 2 0 0 0ENr121 Random 500 2 2 0 1 3 0 1 0 1Enr131 Random 500 2 13 1 3 1 4 4 0 1ENr212 Random 500 5 2 0 2 0 0 0 0 0ENr221 Random 500 5 3 1 3 0 2 0 0 1Enr313 Random 500 16 0 0 0 0 0 0 0 0ENr321 Random 500 8 2 0 1 2 0 0 0 1ENr331 Random 500 2 5 5 1 5 2 0 0 0Enr122 Random 500 18 9 1 0 0 1 0 1 0ENr233 Random 500 15 15 0 2 0 1 4 0 0Enr311 Random 500 14 0 0 0 0 1 0 0 0Enr123 Random 500 12 3 0 0 1 1 0 0 0ENr322 Random 500 14 2 0 0 0 2 0 0 0ENr213 Random 500 18 1 0 0 1 0 0 0 0

ENm001 Manual 1877,426 7 12 1 3 5 6 0 0 0ENm003 Manual 500 11 5 2 2 1 1 0 0 0Enm005 Manual 1695,985 21 23 0 10 2 6 0 1 2

ENm007 Manual 1000,876 19 39 0 6 3 6 14 0 1ENm008 Manual 500 16 21 2 3 1 2 3 0 0Enm009 Manual 1001,592 11 44 1 2 1 3 28 0 0Enm010 Manual 500 7 15 0 4 3 6 0 1 2ENm011 Manual 606,048 11 15 1 2 5 2 0 4 0Enm012 Manual 1000 7 2 0 1 0 2 0 0 0Enm013 Manual 1114,424 7 7 0 4 1 3 0 0 1ENm014 Manual 1163,197 7 5 0 3 0 3 0 0 2ENr133 Random 500 21 6 1 2 2 4 0 1 0ENr211 Random 500 16 1 0 1 0 2 0 0 0Enr312 Random 500 11 1 0 1 2 0 0 1 0Enr332 Random 500 11 13 0 0 2 1 0 1 0

TOTAL 29997,995 416 26 82 78 104 66 15 18Total 13 first 8538,447 129 9 19 30 40 9 3 6Total 31 others 21459,548 287 17 63 48 64 57 12 12

Locus Type

Loci annotated in the 44 ENCODE regions

Page 3: The Havana-Gencode annotation GENCODE CONSORTIUM

Experimental validations of the manual annotations

5'RACEs to obtain full length mRNA(s)

RT-PCRs to check 360 junctions

Bidirectionnal RACEs to obtain full length

mRNAs

Experimental validation of the

single exon annotated

knownNovel protein

Novel transcript

Putative TEC

416 26 82 78 18

The annotations produced by the Havana team at Sanger are being verified experimenally through

RT-PCRs and RACEs (University of Geneva)Initial

annotation

Experimental validations

Updated annotation

New set of confirmed genes

Page 4: The Havana-Gencode annotation GENCODE CONSORTIUM

5’RACEs to extend Known and Novel protein genes

- 214 / 426 loci provided positive RACEs for at least one primer (50%)

- About 10% of the successful RACEs extend the loci in 5’ (and some provide new exon junctions)

(some RACE products are still being analysed)

Experimental validations of the manual annotations

Page 5: The Havana-Gencode annotation GENCODE CONSORTIUM

RT-PCRs VEGA Novel_transcript and Putative

Tested Validated % validatedJ unction level 360 73 20.3%

Transcript level 214 59 27.6%Locus level 161 48 29.8%Putative loci 78 11 14.1%

Novel transcript loci 81 36 44.4%

The Novel transcript loci have a higher success rate than the

Putative loci (in accordance to their definition)

When more than one junction were submitted for the same transcript, all the junctions were in accordance in 2/3 of the cases (mostly all junctions negative).

Experimental validations of the manual annotations

Page 6: The Havana-Gencode annotation GENCODE CONSORTIUM

RT-PCRs on non canonical splice sites

43 non canonical splice sites (non GT-AG or GC-AG) were detected in the 13 training ENCODE regions

32 could be tested by RT-PCR (others: too short exons for primer picking)

1 was confirmed: it is actually a canonical U12 intron (AT-AC)

6 provided canonical junctions (already existing in other annotated splice forms)

25 were negative

=> None of the non canonical splice sites could be validated experimentally

(83 other splice sites are being checked in the 31 other regions)

Experimental validations of the manual annotations

Page 7: The Havana-Gencode annotation GENCODE CONSORTIUM

Gene predictions outside of Havana-Gencode annotations

In 13 ENCODE regions, 1255 predicted introns (by one or more of the 9 methods) are not annotated in VEGA:

- 380 (30%) extend VEGA objects (1)

- 530 (42%) are in introns of VEGA objects (2)

- 11 (1%) link exons from distinct VEGA objects (3)

- 334 (27%) are completely outside of VEGA annotations (4) Havana-Gencode:

Predictions:

(1)

(2)

(3)

(4)

6 computational gene prediction programs (geneid, genscan, SGP, twinscan, fgenesh, exonify) ;

3 EST-based methods (acembly, Ecgene, Ensembl EST)

Page 8: The Havana-Gencode annotation GENCODE CONSORTIUM

1255 predicted introns tested:

=> 16 RT-PCRs confirmed the predicted junction, 9 provided another junction. (excluding pseudogenes)

=> Only 3 are intergenic (new loci?) --> being extended by RACE

Gene predictions outside of Havana-Gencode annotationsRT-PCRs on exons junctions

*1: RT-PCR successful ; 2: RT-PCR povided a product with a wrong exon junction

Predictor Identifier

ENr232 Twinscan chr9.128.008.a-intron-1 685 intronic 1ENr333 Ecgene:acemblyH20C3836.184.ENr333.-1-intron-2:H20C3836.185.ENr333.-1-intron-2:H20C3836.186.ENr333.-1-intron-2:H20C3836.187.ENr333.-1-intron-2:H20C3836.188.ENr333.-1-intron-2:H20C3836.189.ENr333.-1-intron-2:H20C3836.190.ENr333.-1-intron-2:H20C3836.191.ENr333.-1-intron-2327 intronic 1ENr334 Geneid chr6_676.1-intron-1 32543 intronic 1ENr323 Genscan NT_025741.185.ENr323.-1-intron-7 8327 intronic 1ENr231 Acembly PIK4CB.pDec03-intron-1 118 intronic 1ENr231 Acembly TUFT1.cDec03-intron-2 425 Extending 1ENr334 Ecgene H6C6026.1-intron-1 183 intergenic 1ENr333 Ecgene:fgenesh:geneid:genscan:twinscanC20000553.ENr333.-1-intron-3:H20C3776.1.ENr333.-1-intron-3:NT_028392.98.ENr333.-1-intron-3:chr20.35.017.a.ENr333.-1-intron-3:chr20_378.1.ENr333.-1-intron-3148 intronic 2ENr333 Ecgene:fgenesh:geneid:genscan:twinscanC20000553.ENr333.-1-intron-2:H20C3776.1.ENr333.-1-intron-2:NT_028392.98.ENr333.-1-intron-2:chr20.35.017.a.ENr333.-1-intron-2:chr20_378.1.ENr333.-1-intron-2148 intronic 2ENr334 Ecgene H6C6029.1-intron-1:H6C6029.2-intron-1 307 Extending 2ENr333 Ecgene H20C3671.96.ENr333.-1-intron-1 625 intronic 2ENr223 genscan_outof_VEGANT_007299.150.ENr223.-1-intron-5 38 intronic 1

ENm004 ECgene_outof_VEGA:acembly_outof_VEGAH22C3076.70.ENm004.-1-intron-2:PISD.uDec03.ENm004.-1-intron-22247 intronic 1ENm004 ECgene_outof_VEGA:acembly_outof_VEGAH22C3172.3.ENm004.-1-intron-1:RFPL2.bDec03.ENm004.-1-intron-12156 external 1ENm004 geneid_outof_VEGAchr22_380.1.ENm004.+1-intron-2 1807 external 1ENm006 ECgene_outof_VEGAHXC8372.3.ENm006.+1-intron-1 1226 intronic 1ENm006 ECgene_outof_VEGAHXC8602.5.ENm006.-1-intron-1 27179 external 1ENm006 genscan_outof_VEGANT_025307.19.ENm006.+1-intron-3 522 external 1ENr223 acembly_outof_VEGAEIF3S6P1.aDec03.ENr223.+1-intron-1 6981 external 1ENr223 ECgene_outof_VEGA:acembly_outof_VEGAEIF3S6P1.fDec03.ENr223.+1-intron-1:H6C8411.2.ENr223.+1-intron-135640 external 1ENr223 ECgene_outof_VEGA:acembly_outof_VEGA:ensEST_outof_VEGAENSESTT00000043495.0.ENr223.+1-intron-1:ENSESTT00000043496.0.ENr223.+1-intron-1:H6C8448.68.ENr223.+1-intron-1:H6C8448.69.ENr223.+1-intron-1:H6C8448.70.ENr223.+1-intron-1:H6C8448.71.ENr223.+1-intron-1:H6C8448.72.ENr223.+1-intron-1:H6C8448.73.ENr223.+1-intron-1:H6C8448.74.ENr223.+1-intron-1:H6C8448.75.ENr223.+1-intron-1:H6C8448.76.ENr223.+1-intron-1:H6C8448.77.ENr223.+1-intron-1:H6C8448.78.ENr223.+1-intron-1:H6C8448.79.ENr223.+1-intron-1:H6C8448.80.ENr223.+1-intron-1:H6C8448.81.ENr223.+1-intron-1:H6C8448.82.ENr223.+1-intron-1:H6C8448.83.ENr223.+1-intron-1:H6C8448.84.ENr223.+1-intron-1:H6C8448.85.ENr223.+1-intron-1:H6C8448.86.ENr223.+1-intron-1:H6C8448.87.ENr223.+1-intron-1:MTO1.jDec03.ENr223.+1-intron-14040 intronic 2

ENm004 SGP_outof_VEGAchr22_376.1.ENm004.-1-intron-1 414 intronic 2ENr223 geneid_outof_VEGAchr6_921.1.ENr223.+1-intron-3 33810 intronic 2

ENm004 genscan_outof_VEGANT_011520.388.ENm004.+1-intron-7 1764 intergenic 2ENm004 acembly_outof_VEGAsparter.ENm004.+1-intron-1 3534 intergenic 2

ENCODE region

Intron length

Intron type

Confirmed by RT-PCR

Page 9: The Havana-Gencode annotation GENCODE CONSORTIUM

Gene predictions outside of Havana-Gencode annotations:

31 last regions

-About 3500 introns predicted by standard prograns from UCSC tracks are outside of the Havana-Gencode annotation (about 900 intergenic).

Very few of those could correspond to real positive (=> Need to prioritize)

- Additionaly, the EGASP predictions add about 7000 other new introns (about 1000 intergenic)

Page 10: The Havana-Gencode annotation GENCODE CONSORTIUM

Nb of loci / Mb

0

10

20

30

40

50

60

EN

r11

1E

Nr1

12

EN

r11

3E

Nr1

14

EN

r21

1E

Nr2

12

EN

r21

3E

Nr3

11

EN

r31

2E

Nr3

13

EN

r12

1E

Nr1

22

EN

r12

3E

Nr2

21

EN

r22

2E

Nr2

23

EN

r32

1E

Nr3

22

EN

r32

3E

Nr3

24

EN

r13

1E

Nr1

32

EN

r13

3E

Nr2

31

EN

r23

2E

Nr2

33

EN

r33

1E

Nr3

32

EN

r33

3E

Nr3

34

EN

m0

01

EN

m0

02

EN

m0

03

EN

m0

04

EN

m0

05

EN

m0

06

EN

m0

07

EN

m0

08

EN

m0

09

EN

m0

10

EN

m0

11

EN

m0

12

EN

m0

13

EN

m0

14

Description of the annotations:gene density

Page 11: The Havana-Gencode annotation GENCODE CONSORTIUM

Description of the annotations:alternative splicing

Alternative splicing

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

100,00%

known and novel CDS novel transcript and putative

not alternatively spliced, singleexon gene

not alternatively spliced, severalexons

locus alternatively spliced

Nbr of transcripts per locus

0

1

2

3

4

5

6

known and novel CDS novel transcript and putative

Nbr of exons per transcript

0

1

2

3

4

5

6

7

8

known and novel CDS novel transcript and putative

Avg: 4.2 transcripts per locus 6.7 exons per transcript

Page 12: The Havana-Gencode annotation GENCODE CONSORTIUM

Description of the annotations:coding loci

424 44 coding loci in ENCODE regions , 44.6% On average of the transcripts are annotated as coding

Page 13: The Havana-Gencode annotation GENCODE CONSORTIUM

Description of the annotations:lengths of exons, introns, cds, utrs…

mean exon length (all

transcripts)

mean intron length (all transcripts)

mean exon length (coding transcripts)

mean CDS exon length

mean CDS-UTR exon

lengthmean UTR exon length

Mean locus length

Mean intergenic

lengthmean %NT

covered

235,55 4611,38 238,16 143,66 667,20 238,75 33210,48 64808,53142 32,57%

Coding transcripts

0,00

100,00

200,00

300,00

400,00

500,00

600,00

700,00

800,00

mean exon length(all types)

mean CDS exonlength

mean CDS-UTRexon length

mean UTR exonlength

pb

Page 14: The Havana-Gencode annotation GENCODE CONSORTIUM

Comparison between Havana-Gencode annotation and other

sets

ENSEMBL, REFSEQ, MGC, CCDS

Nbr of genes

Nbr of transcripts Nbr of exons

Havana-Gencode 579 2632 2632

REFSEQ 386 539 539

ENSEMBL 456 747 747

MGC 266 424 424

CCDS 280 334 334

Page 15: The Havana-Gencode annotation GENCODE CONSORTIUM

Locus

0

100

200

300

400

500

600

700

Hav

ana/

Ref

seq

SN

=0.

66

Hav

ana/

Ens

embl

SN

=0.

73

Hav

ana/

MG

CS

N=

0.45

Hav

ana/

CC

DS

SN

=0.

49

RE

FS

EQ

/Hav

ana

SP

=1

EN

SE

MB

L/H

avan

aS

P=

0.92

MG

C/H

avan

aS

P=

0.98

CC

DS

/Hav

ana

SP

=1

Nb

of

loc

i

only in other set

only in Havana

common to both sets

Locus, only cds

0

50

100

150

200

250

300

350

400

450

500

Hav

ana/

Ref

seq

SN

=0.8

9

Hav

ana/

Ens

embl

SN

=0.9

5

Hav

ana/

MG

CS

N=0

.61

Hav

ana/

CC

DS

SN

=0.6

5

RE

FSE

Q/H

avan

aS

N=0

.99

EN

SE

MB

L/H

avan

aS

N=0

.87

MG

C/H

avan

aS

N=0

.98

CC

DS

/Hav

ana

SN

=1

Nb

of lo

ci only in other set

only in Havana

common to both sets

=> Most of the genes from the other sets are contained in Havana-Gencode annotation

(less for ENSEMBL)

Gene level

Page 16: The Havana-Gencode annotation GENCODE CONSORTIUM

Transcripts

0

500

1000

1500

2000

2500

3000

Ha

van

a/R

efs

eq

SN

=0

.02

6H

ava

na

/En

sem

bl

SN

=0

.02

7H

ava

na

/MG

CS

N=

0.0

06

Ha

van

a/C

CD

SS

N=

0.0

01

RE

FS

EQ

/Ha

van

aS

P=

0.1

26

EN

SE

MB

L/H

ava

na

SP

=0

.09

6M

GC

/Ha

vana

SP

=0

.03

8C

CD

S/H

ava

na

SP

=0

.00

9

nb

of

tra

ns

cri

pts

only in other set

only in Havana

common to both sets

Transcripts, only CDS

0

200

400

600

800

1000

1200

Hav

ana/

Ref

seq

SN

=0.4

23H

avan

a/E

nsem

blS

N=0

.418

Hav

ana/

MG

CS

N=0

.283

Hav

ana/

CC

DS

SN

=0.3

25R

EF

SE

Q/H

avan

aS

P=0

.712

EN

SE

MB

L/H

avan

aS

P=0

.511

MG

C/H

avan

aS

P=0

.568

CC

DS

/Hav

ana

SP

=0.8

68

nb

of t

ran

scri

pts

only in other set

only in Havana

common to both sets

=> Very few full transcripts are exactly identical

The coding part of the transcripts is better conserved

Transcript level

Page 17: The Havana-Gencode annotation GENCODE CONSORTIUM

=> Few transcripts are exactly identical but most of the transcripts from other sets are included in transcripts from

Havana-Encode, especially MGC genes (transcripts not as extended as the annotation)

Havana-Gencode transcript:

Transcript from

other sets:

Not supporting the

annotated transcript

Supporting the

annotated transcript

Relaxed criterion:allows transcripts from the other sets to be

included in Havana-Gencode transcripts

Page 18: The Havana-Gencode annotation GENCODE CONSORTIUM

Transcripts

0500

10001500200025003000

Hav

ana/

Ref

seq

SN

=0.0

26H

avan

a/E

nsem

blS

N=0

.027

Hav

ana/

MG

CS

N=0

.006

Hav

ana/

CC

DS

SN

=0.0

01R

EF

SE

Q/H

avan

aS

P=0

.126

EN

SE

MB

L/H

avan

a S

P=0

.096

MG

C/H

avan

aS

P=0

.038

CC

DS

/Hav

ana

SP

=0.0

09

nb

of t

ran

scri

pts

only in other set

only in Havana

common to both sets

Transcripts, only CDS

0

200

400

600

800

1000

1200

Hav

ana/

Ref

seq

SN

=0.4

23H

avan

a/E

nsem

blS

N=0

.418

Hav

ana/

MG

CS

N=0

.283

Hav

ana/

CC

DS

SN

=0.3

25R

EF

SE

Q/H

avan

aS

P=0

.712

EN

SE

MB

L/H

avan

aS

P=0

.511

MG

C/H

avan

aS

P=0

.568

CC

DS

/Hav

ana

SP

=0.8

68

nb

of t

ran

scri

pts

only in other set

only in Havana

common to both sets

Transcript level: relaxed criterion

=>

Transcripts, relaxed criterion

0500

10001500200025003000

Hav

ana/

Ref

seq

SN

=0.1

22

Hav

ana/

Ens

embl

SN

=0.1

39

Hav

ana/

MG

CS

N=0

.111

Hav

ana/

CC

DS

SN

=0.1

39

RE

FSE

Q/H

avan

aS

P=0

.594

EN

SE

MB

L/H

avan

aS

P=0

.491

MG

C/H

avan

aS

P=0

.887

CC

DS

/Hav

ana

SP

=0.4

91

nb o

f tra

nscr

ipts only in other set

only in Havana

common to both sets

=>

Transcripts, only CDS, relaxed criterion

0

500

1000

1500

2000

2500

3000

Hav

ana/

Ref

seq

SN

=0.2

10H

avan

a/E

nsem

blS

N=0

.223

Hav

ana/

MG

CS

N=0

.139

Hav

ana/

CC

DS

SN

=0.1

52R

EF

SE

Q/H

avan

aS

P=0

.904

EN

SE

MB

L/H

avan

aS

P=0

.704

MG

C/H

avan

aS

P=0

.940

CC

DS

/Hav

ana

SP

=0.9

61

nb

of t

ran

scri

pts

only in other set

only in Havana

common to both sets

Page 19: The Havana-Gencode annotation GENCODE CONSORTIUM

all exons

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000H

avana/R

efs

eq

SN

=0.3

7

Havana/E

nsem

bl

SN

=0.4

Havana/M

GC

SN

=0.2

Havana/C

CD

S

SN

=0.2

3

RE

FS

EQ

/Havana

SP

=0.8

5

EN

SE

MB

L/H

avana

SP

=0.7

5

MG

C/H

avana

SP

=0.7

6

CC

DS

/Havana

SP

=0.7

9

Nb

of

exo

ns

only in other set

only in Havana

common to both sets

all introns

0

1000

2000

3000

4000

5000

6000

7000

Havana/R

efs

eq

SN

=0.5

8

Havana/E

nsem

bl

SN

=0.6

4

Havana/M

GC

SN

=0.3

3

Havana/C

CD

SS

N=

0.3

9

RE

FS

EQ

/Havana

SP

=0.9

8

EN

SE

MB

L/H

avana

SP

=0.9

MG

C/H

avana

SP

=0.9

8

CC

DS

/Havana

SP

=1

Nb

of

intr

on

s

only in other set

only in Havana

common to both sets

CDS exons

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Havana/R

efs

eq

SN

=0.7

1

Havana/E

nsem

bl

SN

=0.7

5

Havana/M

GC

SN

=0.4

2

Havana/C

CD

SS

N=

0.5

2

RE

FS

EQ

/Havana

SP

=0.9

6

EN

SE

MB

L/H

avana

SP

=0.8

5

MG

C/H

avana

SP

=0.9

5

CC

DS

/Havana

SP

=0.9

8

Nb

of

exo

ns

only in other set

only in Havana

common to both sets

CDS introns

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Hava

na/

Refs

eq

SN

=0.7

5

Hava

na/

Ense

mbl

SN

=0.8

Hava

na/

MG

CS

N=

0.4

3

Hava

na/

CC

DS

SN

=0.5

5

RE

FS

EQ

/Hav

ana

SP

=0.9

8

EN

SE

MB

L/H

avana

SP

=0.8

9

MG

C/H

avana

SP

=0.9

8

CC

DS

/Hav

ana

SP

=0.9

9

Nb

of

intr

on

s

only in other set

only in Havana

common to both sets

=> More common introns than exons: could be explained by the fact that most differences are in UTRs (last exons)

Exon/intron level

Page 20: The Havana-Gencode annotation GENCODE CONSORTIUM

nucleotide

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000H

avan

a/R

efse

qS

N=0

.53

Hav

ana/

Ens

embl

SN

=0.5

7

Hav

ana/

MG

CS

N=0

.27

Hav

ana/

CC

DS

SN

=0.2

3

RE

FS

EQ

/Hav

ana

SP

=0.9

8

EN

SE

MB

L/H

avan

aS

P=0

.95

MG

C/H

avan

aS

P=0

.99

CC

DS

/Hav

ana

SP

=1

Nb

of

NT

only in other set

only in Havana

common to both sets

nucleotide, CDS only

0

100000

200000

300000

400000

500000

600000

700000

800000

Hav

ana/

Ref

seq

SN

=0.8

2

Hav

ana/

Ens

embl

SN

=0.8

8

Hav

ana/

MG

CS

N=0

.43

Hav

ana/

CC

DS

SN

=0.5

8

RE

FS

EQ

/Hav

ana

SP

=0.9

9

EN

SE

MB

L/H

avan

aS

P=0

.92

MG

C/H

avan

aS

P=0

.98

CC

DS

/Hav

ana

SP

=1

Nb

of

NT

only in other set

only in Havana

common to both sets

Nucleotide level

- Havana-Gencode annotation is richer than the other data sets.

-REFSEQ, MGC and CCDS are almost completely contained in Havana –Gencode, especially CCDS (smaller set)

- ENSEMBL contains more “false positives” (bigger set)

- Transcripts from the other sets are less extended than transcripts from Havana-Gencode annotations, especially MGC (very few transcripts are completely identical)

Conclusions

Page 21: The Havana-Gencode annotation GENCODE CONSORTIUM
Page 22: The Havana-Gencode annotation GENCODE CONSORTIUM

exon pairs

02000400060008000

10000120001400016000

Ha

van

a/R

efs

eq

SN

=0

.56

8

Ha

van

a/E

nse

mb

lS

N=

0.5

91

Ha

van

a/M

GC

SN

=0

.33

2

Ha

van

a/C

CD

SS

N=

0.3

66

RE

FS

EQ

/Ha

van

aS

P=

0.8

45

EN

SE

MB

L/H

ava

na

SP

=0

.77

3

MG

C/H

ava

na

SP

=0

.77

8

CC

DS

/Ha

van

aS

P=

0.7

84

nb

of

ex

on

pa

irs

only in other set

only in Havana

common to both sets

exon pairs, relaxed criterion

02000400060008000

10000120001400016000

Hav

ana/

Ref

seq

SN

=0.6

66H

avan

a/E

nsem

blS

N=0

.702

Hav

ana/

MG

CS

N=0

.414

Hav

ana/

CC

DS

SN

=0.4

91R

EFS

EQ

/Hav

ana

SP

=0.9

60E

NS

EM

BL/

Hav

ana

SP

=0.8

88M

GC

/Hav

ana

SP

=0.9

75C

CD

S/H

avan

aS

P=0

.999

nb

of

exo

n p

airs

only in other set

only in Havana

common to both sets

exon pairs, only CDS

0100020003000400050006000700080009000

Hav

ana/

Ref

seq

SN

=0.8

05H

avan

a/E

nsem

blS

N=0

.825

Hav

ana/

MG

CS

N=0

.474

Hav

ana/

CC

DS

SN

=0.5

78R

EF

SE

Q/H

avan

aS

P=0

.961

EN

SE

MB

L/H

avan

aS

P=0

.862

MG

C/H

avan

aS

P=0

.956

CC

DS

/Hav

ana

SP

=0.9

86

nb

of

exo

n p

airs

only in other set

only in Havana

common to both sets

exon pairs, only CDS, relaxed criterion

02000400060008000

10000120001400016000

Hav

ana/

Ref

seq

SN

=0.6

75H

avan

a/E

nsem

blS

N=0

.699

Hav

ana/

MG

CS

N=0

.417

Hav

ana/

CC

DS

SN

=0.6

99R

EFS

EQ

/Hav

ana

SP

=0.9

87E

NS

EM

BL/

Hav

ana

SP

=0.9

06M

GC

/Hav

ana

SP

=0.9

82C

CD

S/H

avan

aS

P=0

.906

nb o

f exo

n pa

irs

only in other set

only in Havana

common to both sets

Exon pair level (exon-intron-exon)