GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)

Preview:

DESCRIPTION

Talk given at the Genome Informatics conference 2012 at Robinson College, Cambridge University.

Citation preview

The Gene Wiki: Crowdsourcing human gene annotation

Andrew Su, Ph.D.@andrewsu

asu@scripps.eduhttp://sulab.org

GeneGames.org

Genome Informatics

September 6, 2012

OK

OK

The Gene Wiki crib sheet

• Bulk creation of ~10k Wikipedia articles (http://dx.doi.org/10.1371/journal.pbio.0060175)

• Monthly stats: > 4 million views, > 1000 edits (http://

dx.doi.org/10.1093/nar/gkr925) • Text mining reveals novel Gene Ontology and Disease

Ontology annotations (http://dx.doi.org/doi:10.1186/1471-2164-12-603)

• Mash-up with SNPedia for crowdsourced gene-disease database (http://www.jbiomedsem.com/content/3/S1/S6)

• Merging Wikipedia with the Semantic Web (http://dx.doi.org/10.1093/database/bar060)

2

http://www.slideshare.net/andrewsu

3

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

4

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

-5

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Using games to fold proteins6

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

http://fold.it

Using games to fold RNAs7

http://eterna.cmu.edu/

Using games to align sequences 8

http://phylo.cs.mcgill.ca

Using games to annotate genes?9

http://genegames.org

No good gene-disease annotation database10

Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease

Query: Apolipoprotein E

No good gene-disease annotation database11

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility

Query: Apolipoprotein E

No good gene-disease annotation database12

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases

Query: Apolipoprotein E

?

?

?

?

?

No good gene-disease annotation database13

Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders 

Query: Apolipoprotein E

Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating

Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …

477 diseases!

Play Dizeez to annotate gene-disease links14

3. If it’s ‘right’, you get points

4. Then on to the next question…

2. Click the related disease (only one is “right”)

5. Hurry!

1. Read the clue (gene)

6. Play to win!

Dizeez players seem pretty smart…15

In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

7 GAST gastrinoma

7 RBP3 retinoblastoma

7 SSX1 synovial sarcoma

6 TG Graves' disease

6 CRYGC Cataract

6 SOX8 mental retardation

6 WRN Werner syndrome

6 ABL1 leukemia

6 MLL3 leukemia

6 SNAI2 breast carcinoma

Pubmed OMIM PharmGKB Gene Wiki

Dizeez players seem pretty smart…16

# Occurrences Gene Disease

5 MECOM sarcoma

4 ATF7 cancer

3 ABCB5 acute myeloid leukemia

3 SART1 glioblastoma

3 NCK1 leukemia

3 NEK1 cancer

Pubmed OMIM PharmGKB Gene Wiki

In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses

Using games to predict phenotype from genotype?17

http://genegames.org

The Cure

Classification problems in genome biology18

cancer normal

find patterns

Classify new samples

cancer

normalSVM

Neural networks

Naïve Bayes

KNN

…100s samples

100,

000s

fea

ture

s

Random forests19

Sample subset of cases and

featuresTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Random forests20

cancer normal

100s samples

100,

000s

fea

ture

s

Random forests21

Classify new samples

cancer

normal

cancer normal

100s samples

100,

000s

fea

ture

s

How to interject biological

knowledge?

Network-guided forests22

Dutkowski & Ideker (2011). PLoS Computational Biology

Network-guided forests23

Sample features by PPI

networkTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Human-guided forests24

Sample features by

human intelligence

Train decision treecancer normal

100s samples

100,

000s

fea

ture

s

The Cure: Genomic predictors for disease25

The Cure: Genomic predictors for disease26

The Cure: Genomic predictors for disease27

The Cure: Genomic predictors for disease28

The Cure: Genomic predictors for disease29

The Cure: Genomic predictors for disease30

Human-guided forests31

Classify new samples

cancer

normal

“Critical Assessment”-style challenge32

Will this work? Check our blog after October 15.

Coming soon to genegames.org

33

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Ben GoodSalvatore LoguercioIan Macleod

Max NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

asu@scripps.edu@andrewsu+Andrew Su

Recruiting graduate students in quantitative biology! See http://education.scripps.edu/

@genegame

Recommended