76
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org Sanger/EBI September 7, 2012

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Embed Size (px)

Citation preview

Page 1: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Andrew Su, Ph.D.@andrewsu

[email protected]://sulab.org

Sanger/EBI

September 7, 2012

Page 2: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Few genes are well annotated…2

38%

59%

TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE

Data: NCBI gene2pubmed, August 2010

23,278 protein-coding genes

Genes, sorted by decreasing counts

Co

un

ts

Gene ontology

PubMed

Page 3: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

… because the literature is sparsely curated?3

Page 4: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

… because the literature is sparsely curated?4

0

1 0

2 0

Average capacity of human scientistNumber of articles read by typical scientist

Page 5: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

5

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 6: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

6

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 7: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Long Tail is a prolific source of content7

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 8: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Wikipedia is reasonably accurate8

Page 9: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Words/ article

Wikipedia Britannica Online

Page 10: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

10

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 11: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

From crowdsourcing to structured data11

The Gene Wiki

Biological Games

Page 12: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 13: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Wiki success depends on a positive feedback13

Gene wiki page utility

Number ofusers

Number ofcontributors

1001

2002

Page 14: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

10,000 gene “stubs” within Wikipedia14

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 15: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki has a critical mass of readers15

Total: 5.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Page 16: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki has a critical mass of editors16

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 17: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

A review article for every gene is powerful17

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Page 18: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Making the Gene Wiki more reliable18

The company name is derived from old Greek, and means

"destroyer of birds".

Novartis is a multinational pharmaceutical company

based in Basel, Switzerland that manufactures drugs such

as clozapine (Clozaril), diclofenac (Voltaren), …

2

2

Page 19: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Making the Gene Wiki more reliable19

http://www.wikitrust.net/

The company name is derived from old Greek, and means

"destroyer of birds".

Novartis is a multinational pharmaceutical company

based in Basel, Switzerland that manufactures drugs such

as clozapine (Clozaril), diclofenac (Voltaren), …

*

36211 total edits 36 total edits

High-trust author Low-trust author

******

** *

*

*

**

2

Page 20: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Making the Gene Wiki more computable20

Structured annotationsFree text

Page 21: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Filling the gaps in gene annotation21

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel GO annotations2147 novel DO annotations

Page 22: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

22

TOP 100 GENES

Page 23: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki content improves enrichment analysis23

GO term

Gene listConcept

recognitionPubMed abstracts

Enrichment analysis

GO:0007411

axon guidance

(GO:0007411)

264 genes

Linked genes through PubMed

P = 1.55 E-20

811 articles

Yes No

Yes 13 2

No 251 12033

Page 24: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki content improves enrichment analysis24

GO term

Gene listConcept

recognitionPubMed abstracts

Gene Wiki

+

Enrichment analysis

GO:0006936 GO:0006936

muscle contraction

(GO:0006936)

87 genes

Linked genes through PubMed

Linked genes through

PubMed + Gene Wiki

P = 1.0 P = 1.22 E-09

251 articles

87 articles

Page 25: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki content improves enrichment analysis25

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Page 26: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene Wiki+: Crowdsourced semantic database26

Q: What genes are related to hemolytic anemia?

Page 27: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Long Tail of scientists is a valuable source of

information on gene function

27

Page 28: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

From crowdsourcing to structured data28

The Gene Wiki

Biological Games

Page 29: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Gene databases are numerous and overlapping29

… and hundreds more …

Page 30: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

http://biogps.org

Community extensibility and user customizability30

Page 31: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility

UsersContributors

Utility: A simple and universal plugin interface31

Page 32: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility

UsersContributors

Utility: A simple and universal plugin interface32

Page 33: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility

UsersContributors

Utility: A simple and universal plugin interface33

Page 34: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility

UsersContributors

Utility: A simple and universal plugin interface34

Page 35: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility

UsersContributors

Utility: A simple and universal plugin interface35

Page 36: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Utility: A simple and universal plugin interface36

Utility

UsersContributors

Total of 389 gene-centric online databases registered as BioGPS plugins

Page 37: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Users: BioGPS has critical mass37

• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviewsUtility

UsersContributors

Page 38: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Contributors: Explicit and implicit knowledge38

389 plugins registered (65% publicly shared)

by over 75 users

spanning 150+ domains

Utility

UsersContributors

Page 39: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Mining structured content from HTML39

Page 40: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Defining a data extraction template40

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

Page 41: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The BioGPS Semantic Annotator41

http://50.112.124.237

Page 42: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Long Tail of

bioinformaticianscan collaboratively build a gene portal.

42

Page 43: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

From crowdsourcing to structured data43

The Gene Wiki

Biological Games

Page 44: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

44

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

Page 45: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

45

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

Page 46: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

-46

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Page 47: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Using games to fold proteins47

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Page 48: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Using games to fold RNAs48

http://eterna.cmu.edu/

Page 49: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Using games to align sequences 49

http://phylo.cs.mcgill.ca

Page 50: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Using games to annotate genes?50

http://genegames.org

Page 51: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

No good gene-disease annotation database51

Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease

Query: Apolipoprotein E

Page 52: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

No good gene-disease annotation database52

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility

Query: Apolipoprotein E

Page 53: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

No good gene-disease annotation database53

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases

Query: Apolipoprotein E

?

?

?

?

?

Page 54: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

No good gene-disease annotation database54

Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders 

Query: Apolipoprotein E

Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating

Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …

477 diseases!

Page 55: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Play Dizeez to annotate gene-disease links55

3. If it’s ‘right’, you get points

4. Then on to the next question…

2. Click the related disease (only one is “right”)

5. Hurry!

1. Read the clue (gene)

6. Play to win!

Page 56: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Dizeez players seem pretty smart…56

In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

7 GAST gastrinoma

7 RBP3 retinoblastoma

7 SSX1 synovial sarcoma

6 TG Graves' disease

6 CRYGC Cataract

6 SOX8 mental retardation

6 WRN Werner syndrome

6 ABL1 leukemia

6 MLL3 leukemia

6 SNAI2 breast carcinoma

Pubmed OMIM PharmGKB Gene Wiki

Page 57: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Dizeez players seem pretty smart…57

# Occurrences Gene Disease

5 MECOM sarcoma

4 ATF7 cancer

3 ABCB5 acute myeloid leukemia

3 SART1 glioblastoma

3 NCK1 leukemia

3 NEK1 cancer

Pubmed OMIM PharmGKB Gene Wiki

In total (since Dec 2011):• 207 unique gamers• 1045 games played• 8525 guesses

Page 58: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Using games to predict phenotype from genotype?58

http://genegames.org

The Cure

Page 59: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Classification problems in genome biology59

cancer normal

find patterns

Classify new samples

cancer

normalSVM

Neural networks

Naïve Bayes

KNN

…100s samples

100,

000s

fea

ture

s

Page 60: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Random forests60

Sample subset of cases and

featuresTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Page 61: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Random forests61

cancer normal

100s samples

100,

000s

fea

ture

s

Page 62: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Random forests62

Classify new samples

cancer

normal

cancer normal

100s samples

100,

000s

fea

ture

s

How to interject biological

knowledge?

Page 63: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Network-guided forests63

Dutkowski & Ideker (2011). PLoS Computational Biology

Page 64: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Network-guided forests64

Sample features by PPI

networkTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Page 65: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Human-guided forests65

Sample features by

human intelligence

Train decision treecancer normal

100s samples

100,

000s

fea

ture

s

Page 66: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

66

Page 67: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease67

Page 68: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease68

Page 69: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease69

Page 70: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease70

Page 71: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease71

Page 72: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Cure: Genomic predictors for disease72

Page 73: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

Human-guided forests73

Classify new samples

cancer

normal

Page 74: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

“Critical Assessment”-style challenge74

Will this work? Check our blog after October 15.

Coming soon to genegames.org

Page 75: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

The Long Tail of gamerscan collaboratively build an accurate disease classifier.

75

Page 76: Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)

76

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Ben GoodSalvatore LoguercioIan Macleod

Max NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

[email protected]@andrewsu+Andrew Su

Recruiting graduate students in quantitative biology! See http://education.scripps.edu/

@genegame