Upload
andrew-su
View
907
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Talk given at the Genome Informatics conference 2012 at Robinson College, Cambridge University.
Citation preview
The Gene Wiki: Crowdsourcing human gene annotation
Andrew Su, Ph.D.@andrewsu
[email protected]://sulab.org
GeneGames.org
Genome Informatics
September 6, 2012
OK
OK
The Gene Wiki crib sheet
• Bulk creation of ~10k Wikipedia articles (http://dx.doi.org/10.1371/journal.pbio.0060175)
• Monthly stats: > 4 million views, > 1000 edits (http://
dx.doi.org/10.1093/nar/gkr925) • Text mining reveals novel Gene Ontology and Disease
Ontology annotations (http://dx.doi.org/doi:10.1186/1471-2164-12-603)
• Mash-up with SNPedia for crowdsourced gene-disease database (http://www.jbiomedsem.com/content/3/S1/S6)
• Merging Wikipedia with the Semantic Web (http://dx.doi.org/10.1093/database/bar060)
2
http://www.slideshare.net/andrewsu
3
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
4
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-5
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins6
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
http://fold.it
Using games to annotate genes?9
http://genegames.org
No good gene-disease annotation database10
Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease
Query: Apolipoprotein E
No good gene-disease annotation database11
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility
Query: Apolipoprotein E
No good gene-disease annotation database12
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
No good gene-disease annotation database13
Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders
Query: Apolipoprotein E
Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating
Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …
477 diseases!
Play Dizeez to annotate gene-disease links14
3. If it’s ‘right’, you get points
4. Then on to the next question…
2. Click the related disease (only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
Dizeez players seem pretty smart…15
In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
7 GAST gastrinoma
7 RBP3 retinoblastoma
7 SSX1 synovial sarcoma
6 TG Graves' disease
6 CRYGC Cataract
6 SOX8 mental retardation
6 WRN Werner syndrome
6 ABL1 leukemia
6 MLL3 leukemia
6 SNAI2 breast carcinoma
Pubmed OMIM PharmGKB Gene Wiki
Dizeez players seem pretty smart…16
# Occurrences Gene Disease
5 MECOM sarcoma
4 ATF7 cancer
3 ABCB5 acute myeloid leukemia
3 SART1 glioblastoma
3 NCK1 leukemia
3 NEK1 cancer
Pubmed OMIM PharmGKB Gene Wiki
In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses
Using games to predict phenotype from genotype?17
http://genegames.org
The Cure
Classification problems in genome biology18
cancer normal
find patterns
Classify new samples
cancer
normalSVM
Neural networks
Naïve Bayes
KNN
…100s samples
100,
000s
fea
ture
s
Random forests19
Sample subset of cases and
featuresTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Random forests20
cancer normal
100s samples
100,
000s
fea
ture
s
Random forests21
Classify new samples
cancer
normal
cancer normal
100s samples
100,
000s
fea
ture
s
How to interject biological
knowledge?
Network-guided forests22
Dutkowski & Ideker (2011). PLoS Computational Biology
Network-guided forests23
Sample features by PPI
networkTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Human-guided forests24
Sample features by
human intelligence
Train decision treecancer normal
100s samples
100,
000s
fea
ture
s
The Cure: Genomic predictors for disease25
The Cure: Genomic predictors for disease26
The Cure: Genomic predictors for disease27
The Cure: Genomic predictors for disease28
The Cure: Genomic predictors for disease29
The Cure: Genomic predictors for disease30
Human-guided forests31
Classify new samples
cancer
normal
“Critical Assessment”-style challenge32
Will this work? Check our blog after October 15.
Coming soon to genegames.org
33
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Ben GoodSalvatore LoguercioIan Macleod
Max NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Recruiting graduate students in quantitative biology! See http://education.scripps.edu/
@genegame