Upload
elena-simperl
View
120
Download
0
Tags:
Embed Size (px)
Citation preview
Improving DBpedia (one microtask at a time)
Elena Simperl
University of Southampton
Google, San Francisco 21 April 2015
DBpedia
Class Instances
Resource (overall) 4,233,000
Place 735,000
Person 1,450,000
Work 411,000
Species 251,000
Organisation 241,000 2
4.58M things
Crowds or no crowds?
• Study different ways to crowdsource entity typing using paid microtasks.
• Three workflows
– Free associations
– Validating the machine
– Exploring the DBpedia ontology
3
What to crowdsource
• Entity typing (free associations)
4
E
C
What to crowdsource (2)
• Entity typing (from a list of suggestions)
5
E - City
- SportsTeam - Municipality
- PopulatedPlace
C
How to crowdsource: no suggestions
Workflow
Ask crowd to suggest
classes
Take top k
Ask crowd to vote the best match
Pros/cons
+ No biases
+ No pre-processing
– Vocabulary convergence
– Time and costs
– The more classifications the better
– Two steps
6
How to crowdsource: with suggestions
Two options
• Generate a shortlist
– Automatically
• Show all available options
– As a tree
Pros/cons
+ Focused, cheap, fast
– Too many classes (685!), see [Miller, 1956]
– Not the right classes
– Tool does not perform well
– Crowd is not familiar with classes, see [Rosch et al., 1976], [Tanaka & Taylor, 1991]
7
How to crowdsource: microtasks
8
How to crowdsource: microtasks (2)
9
Experiments: Data
• Classified entities in popular categories
• Test workflows, compare crowd and machine performance
E1: Baseline, 120 entities
• Test the three workflows on data that cannot be classified automatically
E2: Unclassified entities, 12o
entities
• Fewer judgements • Lower level of tool support
E3: Unclassified
entities, optimized, 120
entities
Experiments: Methods
• Adjusted precision metric to take into account broader and narrower matches, as well as synonyms
• Gold standard (for E2 and E3)
– Two annotator, Cohen kappa of 0.7
– Conflicts resolved via small set of rules and discussions
11
Overall results
• Shortlists are easy & fast
• Freedom comes with a price
• Working at the basic level of abstraction achieves greatest precision
– Even when there is too much choice
12
Other observations
• Unclassified entities might be unclassifiable
– Different entity summary
– Freetext or explorative workflow
• Popular classes are not enough
– Alternative approach to browse the taxonomy
• The basic level of abstraction in DBpedia is user-friendly
– But when given the freedom to choose, users suggest more specific classes
– Domain-specific vocabulary is not welcome
13
Conclusions
• In knowledge engineering, microtask crowdsourcing has focused on improving the results of automatic algorithms
• We know too little about those cases in which algorithms fail
• No optimal workflow in sight
• The DBpedia ontology needs revision
14
Using microtasks to crowdsource DBpedia entity classification: a study in workflow design E Simperl, Q Bu, Y Li Submitted to SWJ, 2015
Email: [email protected]
Twitter: @esimperl
15