Upload
goodb
View
407
Download
4
Tags:
Embed Size (px)
DESCRIPTION
This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
Citation preview
bioLogicalmass collaboration
Benjamin GoodUniversity of British Columbia
Symposium on (Bio)semantics for complex systems biology, Leiden University Medical Center
12 March 2009.
mass collaboration
- calling on a million minds...
X YXR
YX Y
bioLogic
X Y
The plan for today
Mostly-manual strategies for creating bioLogical knowledge
• pull
➡social tagging
• push
➡frames and games
pull
1. incentive
• passive altruism: actions taken for individual gain result in collective benefit.
pull
2. example
• hyperlinks: individual website authors did not intend to make Google possible...
Social tagging
(image from Lund (2006) http://xtech06.usefulinc.com/schedule/paper/75)
bioLogic captured
hasTagURI T
More data captured
JaneTaggerTagging Event
hippocampus mri image wikipedia
http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg
2007-8-29
Resource Tagged
Tagging Context
Associated Tags
Tagger
Tags
• Not the same as either professionally or automatically generated keywords.
- (Al-Khalifa & Davis 2007)
• Can be used to improve Web search
- (Morrison 2008)
Tagging in science?
• How does social tagging compare to professional indexing in the life sciences?
• (Good, Tennis, Wilkinson in preparation)
“Tuned responses of astrocytes and their influence on hemodynamic signals in the visual cortex”
growth of CiteulikeNumber Distinct Pubmed Documents tagged per month
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
29-Oct-1999
25-Jul-2002
20-Apr-2005
15-Jan-2008
11-Oct-2010
7-Jul-2013
2-Apr-2016
28-Dec-2018
23-Sep-2021
19-Jun-2024
N d
isti
nct
PM
IDS
Citeulike ObservedCiteulike Extrapolated95% lower bound95% upper boundMEDLINELinear (MEDLINE)Linear (Citeulike Extrapolated)Extrapolated Upper BoundExtrapolated Lower Bound
pmids/month
but..
MeSH Descriptors per Pubmed Citation
N tags
De
nsity
0 2 4 6 8 11 14 17 20 23 26 29
0.0
0.1
0.2
0.3
0.4
0.5
Tags per Pubmed Citation: Citeulike Aggregate
N tags
De
nsity
0 2 4 6 8 11 14 17 20 23 26 29
0.0
0.1
0.2
0.3
0.4
0.5
because..
!
!
!
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0 5 10 15 20 25 30
02000
6000
10000
14000
N posts
N c
itations
Posts per pubmed Citation: Connotea
!
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!! !! !! ! !
0 20 40 600
2000
4000
6000
8000
10000
N posts
N c
itations
Posts per pubmed Citation: Citeulike
open social tagging -in science
➡low numbers of tags per post
➡low numbers of posts per document
➡low value of tags as descriptors..
adding value to each tag
• social semantic tagging,
➡tagging with encoded concepts instead of strings of letters
➡ = the Entity Describer (E.D.)
Good, Kawas, Wilkinson (2007) Bridging the gap between social tagging and semantic annotation. Nature Precedings
Tagging with Connotea
Typical tagging
User types in all tags
Type-ahead displays
previously used tags
Tagging with E.D.
Adding a semantic
tag
Adding a semantic tag
More data captured for each tag
E.D. can be customized
• Tag with:
genes, gene ontology terms, terms from OWL ontologies
• Recently used to conduct a successful experiment in BioMoby Web service annotation
but!
• Does not address the volume problem - more participation is needed to make social tagging a useful source of bioLogical knowledge.
The plan for today
Mostly-manual strategies for creating bioLogical knowledge
• pull
➡social tagging
• push
➡frames and games
push
• Key difference from pull model is that system designers push specific requests to users
• many incentive options:
financial, psychological...
Pushy pattern
1. design frame for knowledge to be collected
2. choose incentive system
3. design interface
4. collect knowledge
5. aggregate knowledge
?? ?
Mechanical Turk: pushing with money
• A “marketplace for work” hosted by Amazon Inc. “artificial artificial intelligence”
Mechanical Turk and NLP
• Snow et al (2008)
- used workers on the AMT to label text for use in training/testing NLP algorithms.
- word sense disambiguation, affect recognition and several more.
Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
Snow et al (2008) cont.
Results for affect recognition
• labels = 7000
• cost = $2
• time = 5.9 hours
• when aggregated, results equal or better than expert labelers in most cases.
Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
ESP game, pushing with fun
Von Ahn and Dabbish (2004) Labeling Images with a Computer Game http://www.cs.cmu.edu/~biglou/ESP.pdf
ESP game results (2004)
• >4 million images labeled
• >23,000 players
• Given 5,000 players online simultaneously, could label all of the images accessible to Google in a month
• (See the “Google image labeling game”…)
iCAPTURer: assessing push for bioLogical
knowledge• Can we acquire bio-ontological
knowledge from untrained volunteers in a scalable, Web-based manner?
• 2 experiments in the context of scientific conferences
Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development. Good and Wilkinson 2007. Ontology engineering using volunteer labor
iCAPTURer 1
1. Identify concepts from text
2. Link concepts to synonyms and to hyponyms (‘x is_a y’) rooted in the UMLS Semantic Network
Goals
Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.
iCAPTURer 1 - terminology builder
Automatic term extraction - Text2Onto
immune response
Volunteers filter terms and extend terminology
Abstracts
Candidate terms
Validated terms
Taste barCell foo
Cell biology queenGlucose cell
apoptosisimmune response smooth muscle cell
smooth muscle cell
iCAPTURer 1 - taxonomy builder
Volunteers assign parents
Validated terms
smooth muscle cellT-cell activation
apoptosis
UMLS Semantic Network
Entity
Physical_Object
Event
Process
Generic Concept
Conceptual_Entity Activity
iCAPTURer 1 - taxonomy builder
smooth muscle cell T-cell activationapoptosis
UMLS Semantic Network
Entity
Physical_Object
Event
Process
Generic Concept
Conceptual_Entity Activity
iCAPTURer 1 results regarding volunteers
• Recruiting went surprisingly well.
• Volume of contributions highly skewed - a few did most of the work
Participation curve
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Volunteer
12
7
Percent of total knowledge added
knowledge gathered
Terms
232auto.+
429man. = 661
Hyponyms
207Synonyms
340
1) Collection: 2 days , 68 participants
93% true > false 49% true > false 54% true > false
A: Terms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639
%”true” votes
Term
C: Hyponyms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232
hyponym
B: Synonyms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331
synonym
2) Evaluation: 3 days , 65 participants, 11,545 votes
knowledge gathered
Terms
232auto.+
429man. = 661
Hyponyms
207Synonyms
340
1) Collection: 2 days , 68 participants
93% true > false 49% true > false 54% true > false
A: Terms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639
%”true” votes
Term
C: Hyponyms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232
hyponym
B: Synonyms sorted by fraction "true" votes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331
synonym
2) Evaluation: 3 days , 65 participants, 11,545 votes
Knowledge capture at YI forum
Evaluation conducted via email request
Number of assertions gathered
11,000
1,000
Initial acquisition verse evaluation
Knowledge capture at YI forum
Evaluation conducted via email request
Number of assertions gathered
• Forms• Tree navigation• Conference setting• 2 days• 65 people
• Multiple choice (voting)
• Home setting• 3 days• 68 people
“I assert that t cell activation is a kind of immune response”
“I agree that t cell activation is a kind of immune response”
11,000
1,000
Initial acquisition verse evaluation
iCAPTURer 2 pattern
1. Infer complete ontology
2. Present each edge as a multiple choice question {true, false, I don’t know}
3. Aggregate votes to decide on each triple
iCAPTURer 2 knowledge sought
? subClassOf ?X Y
(immunology)
iCAPTURer2 results
• Same pattern of participation
• Only 66% correct overall in assessing subClass assertions
• highly biased towards saying ‘yes’.
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Volunteer
frac
tion
sub
clas
s ju
dgm
ents
mad
e
iCAPTURer summary
• Scientifically relevant tasks are harder - the population pool is smaller, but - in my experience generally very willing.
• Engaging the competitive instinct was helpful in obtaining the responses we did.
• Much room for further investigation.
Small steps
• but apparently in a promising direction
Filling in Freebase with Typewriter
http://typewriter.freebaseapps.com/ March 9, 2009
? is a ?X Y
Filling in Freebase with Typewriter
http://typewriter.freebaseapps.com/ March 9, 2009
? is a ?X Y
X YXR
YX YX Y
To achieve mass collaborative bioLogical knowledge assembly, make it possible for people to contribute in multiple modes
- as creators- as evaluators- as system builders (open APIs are crucial)
and for multiple reasons- personal information management- fun, competition- finance
“...how you envision future developments...”
Automation
“...how you envision future developments...”
Automation Human computation
+
“...how you envision future developments...”
Automation Human computation
+
= increasingly high-throughput bioLogical knowledge representation
“...how your own expertise would fit into this realm...”
ben
more bioLogical analyses
requires
knows a bit about
knowledge representationmachine learningcommunity action
http://biordf.net/~bgood/
Thanks to
• developers: Eddie Kawas, Paul Lu
• advisor: Mark Wilkinson
• Barend Mons for the invitation and Marco Roos for the accommodation!
http://biordf.net/~bgood/