Bio Logical Mass Collaboration3

bioLogicalmass collaboration

Benjamin GoodUniversity of British Columbia

Symposium on (Bio)semantics for complex systems biology, Leiden University Medical Center

12 March 2009.

mass collaboration

- calling on a million minds...

X YXR

YX Y

bioLogic

X Y

The plan for today

Mostly-manual strategies for creating bioLogical knowledge

• pull

➡social tagging

• push

➡frames and games

pull

1. incentive

• passive altruism: actions taken for individual gain result in collective benefit.

pull

2. example

• hyperlinks: individual website authors did not intend to make Google possible...

Social tagging

(image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)

http://xtech06.usefulinc.com/schedule/paper/75

http://xtech06.usefulinc.com/schedule/paper/75

bioLogic captured

hasTagURI T

More data captured

JaneTaggerTagging Event

hippocampus mri image wikipedia

http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg

2007-8-29

Resource Tagged

Tagging Context

Associated Tags

Tagger

Tags

• Not the same as either professionally or automatically generated keywords.

- (Al-Khalifa & Davis 2007)

• Can be used to improve Web search

- (Morrison 2008)

Tagging in science?

• How does social tagging compare to professional indexing in the life sciences?

• (Good, Tennis, Wilkinson in preparation)

“Tuned responses of astrocytes and their influence on hemodynamic signals in the visual cortex”

growth of CiteulikeNumber Distinct Pubmed Documents tagged per month

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

29-Oct-1999

25-Jul-2002

20-Apr-2005

15-Jan-2008

11-Oct-2010

7-Jul-2013

2-Apr-2016

28-Dec-2018

23-Sep-2021

19-Jun-2024

N d

isti

nct

PM

IDS

Citeulike ObservedCiteulike Extrapolated95% lower bound95% upper boundMEDLINELinear (MEDLINE)Linear (Citeulike Extrapolated)Extrapolated Upper BoundExtrapolated Lower Bound

pmids/month

but..

MeSH Descriptors per Pubmed Citation

N tags

De

nsity

0 2 4 6 8 11 14 17 20 23 26 29

0.0

0.1

0.2

0.3

0.4

0.5

Tags per Pubmed Citation: Citeulike Aggregate

N tags

De

nsity

0 2 4 6 8 11 14 17 20 23 26 29

0.0

0.1

0.2

0.3

0.4

0.5

because..

!

!

!

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

0 5 10 15 20 25 30

02000

6000

10000

14000

N posts

N c

itations

Posts per pubmed Citation: Connotea

!

!

!

!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!! !! !! ! !

0 20 40 600

2000

4000

6000

8000

10000

N posts

N c

itations

Posts per pubmed Citation: Citeulike

open social tagging -in science

➡low numbers of tags per post

➡low numbers of posts per document

➡low value of tags as descriptors..

adding value to each tag

• social semantic tagging,

➡tagging with encoded concepts instead of strings of letters

➡ = the Entity Describer (E.D.)

Good, Kawas, Wilkinson (2007) Bridging the gap between social tagging and semantic annotation. Nature Precedings

Tagging with Connotea

Typical tagging

User types in all tags

Type-ahead displays

previously used tags

Tagging with E.D.

Adding a semantic

tag

Adding a semantic tag

More data captured for each tag

E.D. can be customized

• Tag with:

genes, gene ontology terms, terms from OWL ontologies

• Recently used to conduct a successful experiment in BioMoby Web service annotation

but!

• Does not address the volume problem - more participation is needed to make social tagging a useful source of bioLogical knowledge.

The plan for today

Mostly-manual strategies for creating bioLogical knowledge

• pull

➡social tagging

• push

➡frames and games

push

• Key difference from pull model is that system designers push specific requests to users

• many incentive options:

financial, psychological...

Pushy pattern

1. design frame for knowledge to be collected

2. choose incentive system

3. design interface

4. collect knowledge

5. aggregate knowledge

?? ?

Mechanical Turk: pushing with money

• A “marketplace for work” hosted by Amazon Inc. “artificial artificial intelligence”

Mechanical Turk and NLP

• Snow et al (2008)

- used workers on the AMT to label text for use in training/testing NLP algorithms.

- word sense disambiguation, affect recognition and several more.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

Snow et al (2008) cont.

Results for affect recognition

• labels = 7000

• cost = $2

• time = 5.9 hours

• when aggregated, results equal or better than expert labelers in most cases.

Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263

ESP game, pushing with fun

Von Ahn and Dabbish (2004) Labeling Images with a Computer Game http://www.cs.cmu.edu/~biglou/ESP.pdf

http://www.cs.cmu.edu/~biglou/ESP.pdf

http://www.cs.cmu.edu/~biglou/ESP.pdf

ESP game results (2004)

• >4 million images labeled

• >23,000 players

• Given 5,000 players online simultaneously, could label all of the images accessible to Google in a month

• (See the “Google image labeling game”…)

iCAPTURer: assessing push for bioLogical

knowledge• Can we acquire bio-ontological

knowledge from untrained volunteers in a scalable, Web-based manner?

• 2 experiments in the context of scientific conferences

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development. Good and Wilkinson 2007. Ontology engineering using volunteer labor

iCAPTURer 1

1. Identify concepts from text

2. Link concepts to synonyms and to hyponyms (‘x is_a y’) rooted in the UMLS Semantic Network

Goals

Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.

iCAPTURer 1 - terminology builder

Automatic term extraction - Text2Onto

immune response

Volunteers filter terms and extend terminology

Abstracts

Candidate terms

Validated terms

Taste barCell foo

Cell biology queenGlucose cell

apoptosisimmune response smooth muscle cell

smooth muscle cell

iCAPTURer 1 - taxonomy builder

Volunteers assign parents

Validated terms

smooth muscle cellT-cell activation

apoptosis

UMLS Semantic Network

Entity

Physical_Object

Event

Process

Generic Concept

Conceptual_Entity Activity

iCAPTURer 1 - taxonomy builder

smooth muscle cell T-cell activationapoptosis

UMLS Semantic Network

Entity

Physical_Object

Event

Process

Generic Concept

Conceptual_Entity Activity

iCAPTURer 1 results regarding volunteers

• Recruiting went surprisingly well.

• Volume of contributions highly skewed - a few did most of the work

Participation curve

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Volunteer

12

7

Percent of total knowledge added

knowledge gathered

Terms

232auto.+

429man. = 661

Hyponyms

207Synonyms

340

1) Collection: 2 days , 68 participants

93% true > false 49% true > false 54% true > false

A: Terms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639

%”true” votes

Term

C: Hyponyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232

hyponym

B: Synonyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331

synonym

2) Evaluation: 3 days , 65 participants, 11,545 votes

knowledge gathered

Terms

232auto.+

429man. = 661

Hyponyms

207Synonyms

340

1) Collection: 2 days , 68 participants

93% true > false 49% true > false 54% true > false

A: Terms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639

%”true” votes

Term

C: Hyponyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232

hyponym

B: Synonyms sorted by fraction "true" votes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331

synonym

2) Evaluation: 3 days , 65 participants, 11,545 votes

Knowledge capture at YI forum

Evaluation conducted via email request

Number of assertions gathered

11,000

1,000

Initial acquisition verse evaluation

Knowledge capture at YI forum

Evaluation conducted via email request

Number of assertions gathered

• Forms• Tree navigation• Conference setting• 2 days• 65 people

• Multiple choice (voting)

• Home setting• 3 days• 68 people

“I assert that t cell activation is a kind of immune response”

“I agree that t cell activation is a kind of immune response”

11,000

1,000

Initial acquisition verse evaluation

iCAPTURer 2 pattern

1. Infer complete ontology

2. Present each edge as a multiple choice question {true, false, I don’t know}

3. Aggregate votes to decide on each triple

iCAPTURer 2 knowledge sought

? subClassOf ?X Y

(immunology)

iCAPTURer2 results

• Same pattern of participation

• Only 66% correct overall in assessing subClass assertions

• highly biased towards saying ‘yes’.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Volunteer

frac

tion

sub

clas

s ju

dgm

ents

mad

e

iCAPTURer summary

• Scientifically relevant tasks are harder - the population pool is smaller, but - in my experience generally very willing.

• Engaging the competitive instinct was helpful in obtaining the responses we did.

• Much room for further investigation.

Small steps

• but apparently in a promising direction

Filling in Freebase with Typewriter

http://typewriter.freebaseapps.com/ March 9, 2009

? is a ?X Y

http://typewriter.freebaseapps.com


Filling in Freebase with Typewriter

http://typewriter.freebaseapps.com/ March 9, 2009

? is a ?X Y



X YXR

YX YX Y

To achieve mass collaborative bioLogical knowledge assembly, make it possible for people to contribute in multiple modes

- as creators- as evaluators- as system builders (open APIs are crucial)

and for multiple reasons- personal information management- fun, competition- finance

“...how you envision future developments...”

Automation


Automation Human computation

+


Automation Human computation

+

= increasingly high-throughput bioLogical knowledge representation

“...how your own expertise would fit into this realm...”

ben

more bioLogical analyses

requires

knows a bit about

knowledge representationmachine learningcommunity action

http://biordf.net/~bgood/

http://dev.biordf.net/~bgood/


Thanks to

• developers: Eddie Kawas, Paul Lu

• advisor: Mark Wilkinson

• Barend Mons for the invitation and Marco Roos for the accommodation!

http://biordf.net/~bgood/



Technology

Bio Logical Mass Collaboration3