Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Preview:

DESCRIPTION

Given at DBMI seminar series at UCSD. http://dbmi.ucsd.edu/display/DBMI/Seminars

Citation preview

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Andrew Su, Ph.D.@andrewsu

asu@scripps.eduhttp://sulab.org

April 5, 2013

UCSD DBMI Seminar

Few genes are well annotated…2

Data: NCBI, February 2013

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

… because the literature is sparsely curated?3

… because the literature is sparsely curated?4

0

1 0

2 0

Average capacity of human scientistNumber of articles read by typical scientist

5

311,696 articles (1.5% of PubMed)have been cited by GO annotations

6

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

The Long Tail is a prolific source of content7

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Wikipedia is reasonably accurate8

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica Online

10

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

From crowdsourcing to structured data11

The Gene Wiki

Biological Games

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Wiki success depends on a positive feedback14

Gene wiki page utility

Number ofusers

Number ofcontributors

1001

2002

10,000 gene “stubs” within Wikipedia15

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Gene Wiki has a critical mass of readers16

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Gene Wiki has a critical mass of editors17

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor

coun

t Editors

Edits Edi

t co

unt

A review article for every gene is powerful18

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Making the Gene Wiki more computable19

Structured annotationsFree text

Filling the gaps in gene annotation20

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel GO annotations2147 novel DO annotations

Gene Wiki content improves enrichment analysis21

GO term

Gene listConcept

recognitionPubMed abstracts

Enrichment analysis

GO:0007411

axon guidance

(GO:0007411)

264 genes

Linked genes through PubMed

P = 1.55 E-20

811 articles

Yes No

Yes 13 2

No 251 12033

Gene Wiki content improves enrichment analysis22

GO term

Gene listConcept

recognitionPubMed abstracts

Gene Wiki

+

Enrichment analysis

GO:0006936 GO:0006936

muscle contraction

(GO:0006936)

87 genes

Linked genes through PubMed

Linked genes through

PubMed + Gene Wiki

P = 1.0 P = 1.22 E-09

251 articles

87 articles

Gene Wiki content improves enrichment analysis23

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Making the Gene Wiki more computable24

Structured annotationsFree text

Analyses

Making the Gene Wiki more computable25

Structured annotationsFree text

Databases

Making the Gene Wiki more computable26

Databases

Linked Data

The Long Tail of scientists is a valuable source of

information on gene function

27

From crowdsourcing to structured data28

The Gene Wiki

Biological Games

Gene databases are numerous and overlapping29

… and hundreds more …

Why is there so much redundancy?30

Users

Requests

Resources

Time

Communitydevelopment

BioGPS emphasizes community extensibility

Why do developers define the gene report view?31

BioGPS emphasizes user customizability

http://biogps.org

Community extensibility and user customizability32

Utility: A simple and universal plugin interface33

KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

URL template

Gene entityRendered URL

Utility

UsersContributors

Utility: A simple and universal plugin interface34

Utility

UsersContributors

Utility: A simple and universal plugin interface35

Utility

UsersContributors

Utility: A simple and universal plugin interface36

Utility

UsersContributors

Utility: A simple and universal plugin interface37

Utility

UsersContributors

Utility: A simple and universal plugin interface38

Utility: A simple and universal plugin interface39

Utility

UsersContributors

Total of > 540 gene-centric online databases registered as BioGPS plugins

Users: BioGPS has critical mass40

• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviewsUtility

UsersContributors

Contributors: Explicit and implicit knowledge41

540 plugins registered (>300 publicly shared)

by over 120 users

spanning 280+ domains

Utility

UsersContributors

All resources should provide RDF…42

Mining structured content from HTML43

Defining a data extraction template44

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

The BioGPS Semantic Annotator45

http://54.244.135.254:8080

The Long Tail of

bioinformaticianscan collaboratively build a gene portal.

46

From crowdsourcing to structured data47

The Gene Wiki

Biological Games

48

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

49

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

-50

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Using games to fold proteins51

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Using games to fold RNAs52

http://eterna.cmu.edu/

Using games to align sequences 53

http://phylo.cs.mcgill.ca

Using games to diagnose malaria infection54

http://biogames.ee.ucla.edu/

Using games to map neurons55

http://eyewire.org

Using games to annotate genes?56

http://genegames.org

No good gene-disease annotation database57

Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease

Query: Apolipoprotein E

No good gene-disease annotation database58

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility

Query: Apolipoprotein E

No good gene-disease annotation database59

Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases

Query: Apolipoprotein E

?

?

?

?

?

No good gene-disease annotation database60

Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders 

Query: Apolipoprotein E

Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating

Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …

477 diseases!

Play Dizeez to annotate gene-disease links61

3. If it’s ‘right’, you get points

4. Then on to the next question…

2. Click the related disease (only one is “right”)

5. Hurry!

1. Read the clue (gene)

6. Play to win!

Dizeez players seem pretty smart…62

In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

11 NBPF3 neuroblastoma

11 SOX8 mental retardation

9 ABL1 leukemia

9 SSX1 synovial sarcoma

8 APC colorectal cancer

8 FES sarcoma

8 RBP3 retinoblastoma

8 GAST gastrinoma

8 DCC colorectal cancer

8 MAP3K5 cancer

Gene Wiki OMIM PharmGKB PubMed

Using games to predict phenotype from genotype?63

http://genegames.org

Classification problems in genome biology64

cancer normal

find patterns

Classify new samples

cancer

normalSVM

Neural networks

Naïve Bayes

KNN

…100s samples

100,

000s

fea

ture

s

Random forests65

Sample subset of cases and

featuresTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Random forests66

cancer normal

100s samples

100,

000s

fea

ture

s

Random forests67

Classify new samples

cancer

normal

cancer normal

100s samples

100,

000s

fea

ture

s

How to interject biological

knowledge?

Network-guided forests68

Dutkowski & Ideker (2011). PLoS Computational Biology

Network-guided forests69

Sample features by PPI

networkTrain decision

treecancer normal

100s samples

100,

000s

fea

ture

s

Human-guided forests70

Sample features by

human intelligence

Train decision treecancer normal

100s samples

100,

000s

fea

ture

s

71

The Cure: Genomic predictors for disease72

The Cure: Genomic predictors for disease73

The Cure: Genomic predictors for disease74

The Cure: Genomic predictors for disease75

The Cure: Genomic predictors for disease76

The Cure: Genomic predictors for disease77

Human-guided forests78

Classify new samples

cancer

normal

“Critical Assessment”-style challenge79

Results

• 214 registered players– 50% declared knowledge of cancer

biology– 40% self-identified as having Ph.D.

• Prediction results– 70% correct on survival concordance

index– Best scoring model was 76%– Player registrations still increasing!

80

The Long Tail of gamerscan collaboratively build an accurate disease classifier.

81

82

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Katie FischBen GoodSalvatore Loguercio

Max NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

asu@scripps.edu@andrewsu+Andrew Su

Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni

Doctoral Program in Chemical and Biological Sciences

CALIFORNIAOffice of Graduate

Studies10550 N. Torrey Pines

RoadLa Jolla, CA 92037

Email: gradprgrm@scripps.edu

Phone: 858.784.8469http://education.scripps.edu

Recruiting graduate

students in quantitative biology!

Recommended