58
bioLogical mass collaboration Benjamin Good University of British Columbia Symposium on (Bio)semantics for complex systems biology, Leiden University Medical Center 12 March 2009.

Bio Logical Mass Collaboration3

  • Upload
    goodb

  • View
    407

  • Download
    4

Embed Size (px)

DESCRIPTION

This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.

Citation preview

Page 1: Bio Logical Mass Collaboration3

bioLogicalmass collaboration

Benjamin GoodUniversity of British Columbia

Symposium on (Bio)semantics for complex systems biology, Leiden University Medical Center

12 March 2009.

Page 2: Bio Logical Mass Collaboration3

mass collaboration

- calling on a million minds...

Page 3: Bio Logical Mass Collaboration3

X YXR

YX Y

bioLogic

X Y

Page 4: Bio Logical Mass Collaboration3

The plan for today

Mostly-manual strategies for creating bioLogical knowledge

• pull

➡social tagging

• push

➡frames and games

Page 5: Bio Logical Mass Collaboration3

pull

1. incentive

• passive altruism: actions taken for individual gain result in collective benefit.

Page 6: Bio Logical Mass Collaboration3

pull

2. example

• hyperlinks: individual website authors did not intend to make Google possible...

Page 7: Bio Logical Mass Collaboration3

Social tagging

(image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)

Page 8: Bio Logical Mass Collaboration3

bioLogic captured

hasTagURI T

Page 9: Bio Logical Mass Collaboration3

More data captured

JaneTaggerTagging Event

hippocampus mri image wikipedia

http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg

2007-8-29

Resource Tagged

Tagging Context

Associated Tags

Tagger

Page 10: Bio Logical Mass Collaboration3

Tags

• Not the same as either professionally or automatically generated keywords.

- (Al-Khalifa & Davis 2007)

• Can be used to improve Web search

- (Morrison 2008)

Page 11: Bio Logical Mass Collaboration3

Tagging in science?

• How does social tagging compare to professional indexing in the life sciences?

• (Good, Tennis, Wilkinson in preparation)

Page 12: Bio Logical Mass Collaboration3

“Tuned responses of astrocytes and their influence on hemodynamic signals in the visual cortex”

Page 13: Bio Logical Mass Collaboration3

growth of CiteulikeNumber Distinct Pubmed Documents tagged per month

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

29-Oct-1999

25-Jul-2002

20-Apr-2005

15-Jan-2008

11-Oct-2010

7-Jul-2013

2-Apr-2016

28-Dec-2018

23-Sep-2021

19-Jun-2024

N d

isti

nct

PM

IDS

Citeulike ObservedCiteulike Extrapolated95% lower bound95% upper boundMEDLINELinear (MEDLINE)Linear (Citeulike Extrapolated)Extrapolated Upper BoundExtrapolated Lower Bound

pmids/month

Page 14: Bio Logical Mass Collaboration3

but..

MeSH Descriptors per Pubmed Citation

N tags

De

nsity

0 2 4 6 8 11 14 17 20 23 26 29

0.0

0.1

0.2

0.3

0.4

0.5

Tags per Pubmed Citation: Citeulike Aggregate

N tags

De

nsity

0 2 4 6 8 11 14 17 20 23 26 29

0.0

0.1

0.2

0.3

0.4

0.5

Page 15: Bio Logical Mass Collaboration3

because..

!

!

!

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

0 5 10 15 20 25 30

02000

6000

10000

14000

N posts

N c

itations

Posts per pubmed Citation: Connotea

!

!

!

!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!! !! !! ! !

0 20 40 600

2000

4000

6000

8000

10000

N posts

N c

itations

Posts per pubmed Citation: Citeulike

Page 16: Bio Logical Mass Collaboration3

open social tagging -in science

➡low numbers of tags per post

➡low numbers of posts per document

➡low value of tags as descriptors..

Page 17: Bio Logical Mass Collaboration3

adding value to each tag

• social semantic tagging,

➡tagging with encoded concepts instead of strings of letters

➡ = the Entity Describer (E.D.)

Good, Kawas, Wilkinson (2007) Bridging the gap between social tagging and semantic annotation. Nature Precedings

Page 18: Bio Logical Mass Collaboration3

Tagging with Connotea

Page 19: Bio Logical Mass Collaboration3

Typical tagging

User types in all tags

Type-ahead displays

previously used tags

Page 20: Bio Logical Mass Collaboration3

Tagging with E.D.

Page 21: Bio Logical Mass Collaboration3

Adding a semantic

tag

Page 22: Bio Logical Mass Collaboration3

Adding a semantic tag

Page 23: Bio Logical Mass Collaboration3

More data captured for each tag

Page 24: Bio Logical Mass Collaboration3

E.D. can be customized

• Tag with:

genes, gene ontology terms, terms from OWL ontologies

• Recently used to conduct a successful experiment in BioMoby Web service annotation

Page 25: Bio Logical Mass Collaboration3

but!

• Does not address the volume problem - more participation is needed to make social tagging a useful source of bioLogical knowledge.

Page 26: Bio Logical Mass Collaboration3

The plan for today

Mostly-manual strategies for creating bioLogical knowledge

• pull

➡social tagging

• push

➡frames and games

Page 27: Bio Logical Mass Collaboration3

push

• Key difference from pull model is that system designers push specific requests to users

• many incentive options:

financial, psychological...

Page 28: Bio Logical Mass Collaboration3

Pushy pattern

1. design frame for knowledge to be collected

2. choose incentive system

3. design interface

4. collect knowledge

5. aggregate knowledge

?? ?

Page 29: Bio Logical Mass Collaboration3

Mechanical Turk: pushing with money

• A “marketplace for work” hosted by Amazon Inc. “artificial artificial intelligence”

Page 30: Bio Logical Mass Collaboration3

Mechanical Turk and NLP

• Snow et al (2008)

- used workers on the AMT to label text for use in training/testing NLP algorithms.

- word sense disambiguation, affect recognition and several more.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

Page 31: Bio Logical Mass Collaboration3

Snow et al (2008) cont.

Results for affect recognition

• labels = 7000

• cost = $2

• time = 5.9 hours

• when aggregated, results equal or better than expert labelers in most cases.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

Page 32: Bio Logical Mass Collaboration3

ESP game, pushing with fun

Von Ahn and Dabbish (2004) Labeling Images with a Computer Game http://www.cs.cmu.edu/~biglou/ESP.pdf

Page 33: Bio Logical Mass Collaboration3

ESP game results (2004)

• >4 million images labeled

• >23,000 players

• Given 5,000 players online simultaneously, could label all of the images accessible to Google in a month

• (See the “Google image labeling game”…)

Page 34: Bio Logical Mass Collaboration3

iCAPTURer: assessing push for bioLogical

knowledge• Can we acquire bio-ontological

knowledge from untrained volunteers in a scalable, Web-based manner?

• 2 experiments in the context of scientific conferences

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development. Good and Wilkinson 2007. Ontology engineering using volunteer labor

Page 35: Bio Logical Mass Collaboration3

iCAPTURer 1

1. Identify concepts from text

2. Link concepts to synonyms and to hyponyms (‘x is_a y’) rooted in the UMLS Semantic Network

Goals

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.

Page 36: Bio Logical Mass Collaboration3

iCAPTURer 1 - terminology builder

Automatic term extraction - Text2Onto

immune response

Volunteers filter terms and extend terminology

Abstracts

Candidate terms

Validated terms

Taste barCell foo

Cell biology queenGlucose cell

apoptosisimmune response smooth muscle cell

smooth muscle cell

Page 37: Bio Logical Mass Collaboration3

iCAPTURer 1 - taxonomy builder

Volunteers assign parents

Validated terms

smooth muscle cellT-cell activation

apoptosis

UMLS Semantic Network

Entity

Physical_Object

Event

Process

Generic Concept

Conceptual_Entity Activity

Page 38: Bio Logical Mass Collaboration3

iCAPTURer 1 - taxonomy builder

smooth muscle cell T-cell activationapoptosis

UMLS Semantic Network

Entity

Physical_Object

Event

Process

Generic Concept

Conceptual_Entity Activity

Page 39: Bio Logical Mass Collaboration3

iCAPTURer 1 results regarding volunteers

• Recruiting went surprisingly well.

• Volume of contributions highly skewed - a few did most of the work

Page 40: Bio Logical Mass Collaboration3

Participation curve

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Volunteer

12

7

Percent of total knowledge added

Page 41: Bio Logical Mass Collaboration3
Page 42: Bio Logical Mass Collaboration3

knowledge gathered

Terms

232auto.+

429man. = 661

Hyponyms

207Synonyms

340

1) Collection: 2 days , 68 participants

93% true > false 49% true > false 54% true > false

A: Terms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639

%”true” votes

Term

C: Hyponyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232

hyponym

B: Synonyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331

synonym

2) Evaluation: 3 days , 65 participants, 11,545 votes

Page 43: Bio Logical Mass Collaboration3

knowledge gathered

Terms

232auto.+

429man. = 661

Hyponyms

207Synonyms

340

1) Collection: 2 days , 68 participants

93% true > false 49% true > false 54% true > false

A: Terms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639

%”true” votes

Term

C: Hyponyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232

hyponym

B: Synonyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331

synonym

2) Evaluation: 3 days , 65 participants, 11,545 votes

Page 44: Bio Logical Mass Collaboration3

Knowledge capture at YI forum

Evaluation conducted via email request

Number of assertions gathered

11,000

1,000

Initial acquisition verse evaluation

Page 45: Bio Logical Mass Collaboration3

Knowledge capture at YI forum

Evaluation conducted via email request

Number of assertions gathered

• Forms• Tree navigation• Conference setting• 2 days• 65 people

• Multiple choice (voting)

• Home setting• 3 days• 68 people

“I assert that t cell activation is a kind of immune response”

“I agree that t cell activation is a kind of immune response”

11,000

1,000

Initial acquisition verse evaluation

Page 46: Bio Logical Mass Collaboration3

iCAPTURer 2 pattern

1. Infer complete ontology

2. Present each edge as a multiple choice question {true, false, I don’t know}

3. Aggregate votes to decide on each triple

Page 47: Bio Logical Mass Collaboration3

iCAPTURer 2 knowledge sought

? subClassOf ?X Y

(immunology)

Page 48: Bio Logical Mass Collaboration3

iCAPTURer2 results

• Same pattern of participation

• Only 66% correct overall in assessing subClass assertions

• highly biased towards saying ‘yes’.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Volunteer

frac

tion

sub

clas

s ju

dgm

ents

mad

e

Page 49: Bio Logical Mass Collaboration3

iCAPTURer summary

• Scientifically relevant tasks are harder - the population pool is smaller, but - in my experience generally very willing.

• Engaging the competitive instinct was helpful in obtaining the responses we did.

• Much room for further investigation.

Page 50: Bio Logical Mass Collaboration3

Small steps

• but apparently in a promising direction

Page 51: Bio Logical Mass Collaboration3

Filling in Freebase with Typewriter

http://typewriter.freebaseapps.com/ March 9, 2009

? is a ?X Y

Page 52: Bio Logical Mass Collaboration3

Filling in Freebase with Typewriter

http://typewriter.freebaseapps.com/ March 9, 2009

? is a ?X Y

Page 53: Bio Logical Mass Collaboration3

X YXR

YX YX Y

To achieve mass collaborative bioLogical knowledge assembly, make it possible for people to contribute in multiple modes

- as creators- as evaluators- as system builders (open APIs are crucial)

and for multiple reasons- personal information management- fun, competition- finance

Page 54: Bio Logical Mass Collaboration3

“...how you envision future developments...”

Automation

Page 55: Bio Logical Mass Collaboration3

“...how you envision future developments...”

Automation Human computation

+

Page 56: Bio Logical Mass Collaboration3

“...how you envision future developments...”

Automation Human computation

+

= increasingly high-throughput bioLogical knowledge representation

Page 57: Bio Logical Mass Collaboration3

“...how your own expertise would fit into this realm...”

ben

more bioLogical analyses

requires

knows a bit about

knowledge representationmachine learningcommunity action

http://biordf.net/~bgood/

Page 58: Bio Logical Mass Collaboration3

Thanks to

• developers: Eddie Kawas, Paul Lu

• advisor: Mark Wilkinson

• Barend Mons for the invitation and Marco Roos for the accommodation!

http://biordf.net/~bgood/