Improving DBpedia (one microtask at a time)

Improving DBpedia (one microtask at a time)

Elena Simperl

University of Southampton

Google, San Francisco 21 April 2015

DBpedia

Class Instances

Resource (overall) 4,233,000

Place 735,000

Person 1,450,000

Work 411,000

Species 251,000

Organisation 241,000 2

4.58M things

Crowds or no crowds?

• Study different ways to crowdsource entity typing using paid microtasks.

• Three workflows

– Free associations

– Validating the machine

– Exploring the DBpedia ontology

3

What to crowdsource

• Entity typing (free associations)

4

E

C

What to crowdsource (2)

• Entity typing (from a list of suggestions)

5

E - City

- SportsTeam - Municipality

- PopulatedPlace

C

How to crowdsource: no suggestions

Workflow

Ask crowd to suggest

classes

Take top k

Ask crowd to vote the best match

Pros/cons

+ No biases

+ No pre-processing

– Vocabulary convergence

– Time and costs

– The more classifications the better

– Two steps

6

How to crowdsource: with suggestions

Two options

• Generate a shortlist

– Automatically

• Show all available options

– As a tree

Pros/cons

+ Focused, cheap, fast

– Too many classes (685!), see [Miller, 1956]

– Not the right classes

– Tool does not perform well

– Crowd is not familiar with classes, see [Rosch et al., 1976], [Tanaka & Taylor, 1991]

7

How to crowdsource: microtasks

8

How to crowdsource: microtasks (2)

9

Experiments: Data

• Classified entities in popular categories

• Test workflows, compare crowd and machine performance

E1: Baseline, 120 entities

• Test the three workflows on data that cannot be classified automatically

E2: Unclassified entities, 12o

entities

• Fewer judgements • Lower level of tool support

E3: Unclassified

entities, optimized, 120

entities

Experiments: Methods

• Adjusted precision metric to take into account broader and narrower matches, as well as synonyms

• Gold standard (for E2 and E3)

– Two annotator, Cohen kappa of 0.7

– Conflicts resolved via small set of rules and discussions

11

Overall results

• Shortlists are easy & fast

• Freedom comes with a price

• Working at the basic level of abstraction achieves greatest precision

– Even when there is too much choice

12

Other observations

• Unclassified entities might be unclassifiable

– Different entity summary

– Freetext or explorative workflow

• Popular classes are not enough

– Alternative approach to browse the taxonomy

• The basic level of abstraction in DBpedia is user-friendly

– But when given the freedom to choose, users suggest more specific classes

– Domain-specific vocabulary is not welcome

13

Conclusions

• In knowledge engineering, microtask crowdsourcing has focused on improving the results of automatic algorithms

• We know too little about those cases in which algorithms fail

• No optimal workflow in sight

• The DBpedia ontology needs revision

14

Using microtasks to crowdsource DBpedia entity classification: a study in workflow design E Simperl, Q Bu, Y Li Submitted to SWJ, 2015

Email: [email protected]

Twitter: @esimperl

15

Education

Improving DBpedia (one microtask at a time)