66
Open biomedical knowledge using crowdsourcing and citizen science Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org November 5, 2015 UCSD Slides: slideshare.net/andrewsu

Open biomedical knowledge using crowdsourcing and citizen science

Embed Size (px)

Citation preview

Page 1: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge

using crowdsourcing and

citizen science

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

November 5, 2015

UCSD

Slides: slideshare.net/andrewsu

Page 2: Open biomedical knowledge using crowdsourcing and citizen science

2

Candidate genes

FLNB

CTNNB1

EPHA3

SMAD3

XPO1

RPS27

FLCN

ATR

FLT3

BRD2

ERG

RAF1

EGFR

ERBB4

RARA

JAK3

LRP1

WT1

PML

SMARCA4

Candidate variants

chr1:g.156084782C>G

chr6:g.31911991G>T

chr19:g.3767338C>T

chr19:g.3783925C>T

chr7:g.552021G>A

chr3:g.123005609G>T

Page 3: Open biomedical knowledge using crowdsourcing and citizen science

3

Biology is an

INFORMATIONscience

Pietro Bellini https://flic.kr/p/k5jmja

Page 4: Open biomedical knowledge using crowdsourcing and citizen science

Prioritization of human genetic variants4

1000s of genetic variants

< 10 candidate genes

Filters

- Variant type

- Allele frequencies

- Previous clinical

observation

- Predicted

functional effects

- Gene function

- …

Page 5: Open biomedical knowledge using crowdsourcing and citizen science

Data integration as a cottage industry5

dbNSFP

Page 6: Open biomedical knowledge using crowdsourcing and citizen science

Data integration as hardened community software6

dbNSFP

MyVariant.info

Page 7: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations7

Gene

MyGene.info

Page 8: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations8

http://mygene.info/metadata

Current version history

Current stats

Page 9: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations9

399070

210381

120173

222497292 3563 1767 1031 616 406 2724

10 20 30 40 50 60 70 80 90 100 More

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

request time (ms)

Fre

qu

en

cyGene annotation service (/v2/gene)

Page 10: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations10

2 ~ 3M requests per month

Page 11: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations11

Page 12: Open biomedical knowledge using crowdsourcing and citizen science

MyGene.info for integrating gene annotations12

2015 – 2018

Page 13: Open biomedical knowledge using crowdsourcing and citizen science

Bioinformatician-friendly JSON output, REST API13

http://MyGene.info/v2/gene/7157 http://MyVariant.info/v1/variant/

chr7:g.55241707G>T

Page 14: Open biomedical knowledge using crowdsourcing and citizen science

Variant and gene prioritization14

Page 15: Open biomedical knowledge using crowdsourcing and citizen science

Variant and gene prioritization15

2441

2308

1917

18

9

5

Page 16: Open biomedical knowledge using crowdsourcing and citizen science

Variant and gene prioritization16

2441

2308

1917

18

9

5

https://github.com/SuLab/myvariant.info/

blob/master/docs/ipynb/myvariant_R_miller.ipynb

Page 17: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge17

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Page 18: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge18

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

Page 19: Open biomedical knowledge using crowdsourcing and citizen science

The Gene Wiki project19

Protein structure

Symbols and

identifiers

Tissue expression

pattern

Gene Ontology

annotations

Links to structured

databases

Gene

summary

Protein

interactions

Linked

references

Huss, PLoS Biol, 2008

Page 20: Open biomedical knowledge using crowdsourcing and citizen science

The Gene Wiki project20

Page 21: Open biomedical knowledge using crowdsourcing and citizen science

The Gene Wiki project21

Page 22: Open biomedical knowledge using crowdsourcing and citizen science

Wikidata22

Provide a database of the

world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 23: Open biomedical knowledge using crowdsourcing and citizen science

Centralizing key data storage23

Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf

Page 24: Open biomedical knowledge using crowdsourcing and citizen science

Centralizing key data storage24

Page 25: Open biomedical knowledge using crowdsourcing and citizen science

Centralizing key data storage25

Page 26: Open biomedical knowledge using crowdsourcing and citizen science

Loading biological data into Wikidata26

Entrez

Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Page 27: Open biomedical knowledge using crowdsourcing and citizen science

Wikidata for biology27

is a

regulates

Interacts

with

Protein

Glycoprotein

Neural

development

VLDL receptor

Amyloid

precursor

protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 28: Open biomedical knowledge using crowdsourcing and citizen science

Wikidata for biology28

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 29: Open biomedical knowledge using crowdsourcing and citizen science

29

~150k genes

and proteins

~2k FDA-approved

drugs

~7k human

diseases

Page 30: Open biomedical knowledge using crowdsourcing and citizen science

Centralizing key data storage30

287 language editions of Wikipedia

Bioinformatics

community

Toxicology

community

Epidemiology

community… …

Page 31: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge31

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

Page 32: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge32

Free text to structured data

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

Page 33: Open biomedical knowledge using crowdsourcing and citizen science

The biomedical literature is massive…33

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

Page 34: Open biomedical knowledge using crowdsourcing and citizen science

… but it is very hard to query and compute34

Page 35: Open biomedical knowledge using crowdsourcing and citizen science

… but it is very hard to query and compute35

Imatinib

Crizotinib

Erlotinib

Gefitinib

Sorafenib

Lapatinib

Dasatinib

Acute myeloid leukemia

Acute lymphoblastic leukemia

Chronic myelogenous leukemia

Chronic lymphocytic leukemia

Hodgkin lymphoma

Non-Hodgkin lymphoma

Myeloma

AND

Page 36: Open biomedical knowledge using crowdsourcing and citizen science

The Network of BioThings36

1. Identify biomedical concepts in text

… We report a case of familial systemic

mastocytosis with the rare KIT K509I germ

line mutation. In vitro treatment with imatinib,

dasatinib and PKC412 reduced cell viability

of primary mast cells harboring KIT K509I

mutation. Both patients with familial systemic

mastocytosis had remarkable hematological

and skin improvement after three months of

imatinib treatment.

Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres.

GENES

DISEASES

DRUGS

VARIANTS

Page 37: Open biomedical knowledge using crowdsourcing and citizen science

The Network of BioThings37

imatinib

dasatinib

PKC412

Familial systemic

mastocytosis

KIT

K509I

1. Identify biomedical concepts in text

2. Identify relationships between concepts

Mutation

of

Mutation

causes

causes

treats

inhibits

Page 38: Open biomedical knowledge using crowdsourcing and citizen science

38

Goal: Assemble a network of biomedical

knowledge that is comprehensive,

current, computable and traceable.

Page 39: Open biomedical knowledge using crowdsourcing and citizen science

Question: Can Citizen Scientists

collectively perform concept recognition in

biomedical texts?

39

Page 40: Open biomedical knowledge using crowdsourcing and citizen science

Simple annotation interface40

Click to see

instructions

Highlight

disease

mentions

15 workers annotate each abstract

Page 41: Open biomedical knowledge using crowdsourcing and citizen science

41

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of

“disease concepts”

F = 0.87F = 0.78

$$$

Page 42: Open biomedical knowledge using crowdsourcing and citizen science

42

Experts versus crowd for concept identification

593 PubMed abstracts

6,900 mentions of

“disease concepts”

F = 0.87F = 0.87

$$$

• 9 days

• 145 workers

• Total: $630.96

Page 43: Open biomedical knowledge using crowdsourcing and citizen science

Does Mechanical Turk scale?43

1,000,000 articles per year

10 annotators / article

4 tasks / doc

$0.066 / task

$ 2,640,000 / year

Page 44: Open biomedical knowledge using crowdsourcing and citizen science

44

http://mark2cure.org

Page 45: Open biomedical knowledge using crowdsourcing and citizen science

45

Paid crowdsourcing

• F = 0.84

• 28 days

• 212 workers

• Total cost: $0

$$$

• F = 0.87

• 9 days

• 145 workers

• Total: $630.96

“Help science, please”

Citizen Science

Page 46: Open biomedical knowledge using crowdsourcing and citizen science

Does Citizen Science scale?46

1,000,000 articles * 10 AE / article 15,828

volunteers

needed

10,275 AE * 365 days

212 annotators* 28 days

AE = Annotation events

=

Number of annotation

events per year

Number of annotation

events per year

per volunteer

Page 47: Open biomedical knowledge using crowdsourcing and citizen science

Does Citizen Science scale?47

15,828

volunteers

needed

175,000

volunteers

300,000

volunteers

37,000

volunteers

1,000,000

volunteers

Page 48: Open biomedical knowledge using crowdsourcing and citizen science

Annotating the relationships48

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

therapeutic target

subjectpredicate

object

GENE

DISEASE

Page 49: Open biomedical knowledge using crowdsourcing and citizen science

49

Goal: Assemble a network of biomedical

knowledge that is comprehensive,

current, computable and traceable.

Page 50: Open biomedical knowledge using crowdsourcing and citizen science

50

Nina Hale https://flic.kr/p/zoVih

Page 51: Open biomedical knowledge using crowdsourcing and citizen science

Rare disease case study #151

Photo: Retta Beery

Page 52: Open biomedical knowledge using crowdsourcing and citizen science

52

Bainbridge et al., STM, 2011

Page 53: Open biomedical knowledge using crowdsourcing and citizen science

53

Photo: Retta Beery

Page 54: Open biomedical knowledge using crowdsourcing and citizen science

Rare disease case study #254

Page 55: Open biomedical knowledge using crowdsourcing and citizen science

55

Page 56: Open biomedical knowledge using crowdsourcing and citizen science

56

… but no obvious treatments

Page 57: Open biomedical knowledge using crowdsourcing and citizen science

57

Bainbridge et al., STM, 2011

SPR

Page 58: Open biomedical knowledge using crowdsourcing and citizen science

What differentiates SPR and NGLY1?58

SPR

Page 59: Open biomedical knowledge using crowdsourcing and citizen science

59

Sarah Olmstead

https://flic.kr/p/364dZW

NGLY1

Page 60: Open biomedical knowledge using crowdsourcing and citizen science

60

NGLY1

(11 PubMed articles)

Congenital disorders of

glycosylation

(822)

PNGase

(686)ERAD

(1330)

glycosylation

(48,862)

alacrima

(164)

Genetic

interactors

(3016)

symptoms

(109,928)

24 million articles in PubMed

Page 61: Open biomedical knowledge using crowdsourcing and citizen science

Mapping the biomedical network around NGLY1 61

NGLY1

Page 62: Open biomedical knowledge using crowdsourcing and citizen science

62

Page 63: Open biomedical knowledge using crowdsourcing and citizen science

63

A preliminary view of the NGLY1-

focused biological network

Page 64: Open biomedical knowledge using crowdsourcing and citizen science

Why do I Mark2Cure?64

I am retired, have a doctorate in

medical humanities, and have two

children with Gaucher disease. I am

just looking for some way to put my

education to use. Sounds like a perfect

situation for me.

My 4 year old daughter Phoebe is

living with and battling rare

disease.

I have Ehlers Danlos Syndrome. I hope to help people

learn about this painful and debilitating disorder, so that

others like me can receive more effective medical care.

Take part in

something that

helps humanity.

I Mark2Cure in memory of

my son Mike who had type 1

diabetes.

Studied biology in

college and I really

miss it!

In memory of my daughter

who had Cystic Fibrosis

Give back

Page 65: Open biomedical knowledge using crowdsourcing and citizen science

Open biomedical knowledge65

Free text to structured data

MyVariant.info MyGene.info

Integration of molecular

biology databases via

high performance APIs

Biomedical Linked

Open Data

Page 66: Open biomedical knowledge using crowdsourcing and citizen science

66

Contact

http://sulab.org

[email protected]

@andrewsu

Gene Wiki / Wikidata

Ben Good

Sebastian Burgstaller

Tim Putman

Julia Turner

Ginger Tsueng

Andra Waagmeester

Elvira Mitraka, UMB

Lynn Schriml, UMB

Justin Leong, UBC

Paul Pavlidis, UBC

Join the team!

http://bit.ly/JoinSuLab

Slides: slideshare.net/andrewsu

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

MyGene / MyVariant: HG008473

BD2K COE: GM114833

Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys

Other Group members

Jake Bruggemann

Ramya Gamini

Karthik Gangavarapu

Louis Gioia

Toby Li

Greg Stupp

MyGene / MyVariant

Chunlei Wu

Cyrus Afrasiabi

Kevin Xin

Adam Mark

Mark2Cure

Max Nanis

Ginger Tsueng

Jennifer Fouquier

Ben Good

Chunlei Wu

All Mark2Curators!