34
Methodology and Campaign Design for the Evaluation of Semantic Search Tools Stuart N. Wrigley 1 , Dorothee Reinhard 2 , Khadija Elbedweihy 1 , Abraham Bernstein 2 , Fabio Ciravegna 1 23.04.2010 1 1 University of Sheffield, UK 2 University of Zurich, Switzerland

Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Methodology and Campaign Design for the

Evaluation of Semantic Search Tools

Stuart N. Wrigley1, Dorothee Reinhard2, Khadija Elbedweihy1, Abraham Bernstein2,

Fabio Ciravegna1

23.04.2010

1

1University of Sheffield, UK2University of Zurich, Switzerland

Page 2: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Outline

• SEALS initiative

• Evaluation design

– Criteria

– Two phase approach

– API

– Workflow

• Data

• Results and Analyses

• Conclusions

23.04.2010

2

Page 3: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

SEALS INITIATIVE

23.04.2010

3

Page 4: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

SEALS goals

• Develop and diffuse best practices in evaluation of semantic technologies

• Create a lasting reference infrastructure for semantic technology evaluation

– This infrastructure will be the SEALS Platform

• Facilitate the continuous evaluation of semantic technologies

• Organise two worldwide Evaluation Campaigns

– One this summer

– Next in late 2011 / early 2012

• Allow easy access to both:

– evaluation results (for developers and researchers)

– technology roadmaps (for non-technical adopters)

• Transfer all infrastructure to the community

23.04.2010

4

Page 5: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Targeted technologies

Five different types of semantic technologies:

• Ontology Engineering tools

• Ontology Storage and Reasoning Systems

• Ontology Matching tools

• Semantic Search tools

• Semantic Web Service tools

23.04.2010

5

Page 6: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

What’s our general approach?

• Low overhead to the participant

– Automate as far as possible

– We provide the compute

– We initiate the actual evaluation run

– We perform the analysis

• Encourage participation in evaluation campaign definitions and design

• Provide infrastructure for more than simply running high profile evaluation campaigns

– reuse existing evaluations for your personal testing

– create new ones evaluations

– store / publish / download test data sets

• Open Source (Apache 2.0)

23.04.2010

6

Page 7: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

SEALS Platform

23.04.2010

7

SEALS Service Manager

Result

Repository

Service

Tool

Repository

Service

Test Data

Repository

Service

Evaluation

Repository

Service

Runtime

Evaluation

Service

Technology

Developers

Evaluation Organisers

Technology

Users

Page 8: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

SEARCH EVALUATION DESIGN

23.04.2010

8

Page 9: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

What do we want to do?

• Evaluate / benchmark semantic search tools– with respect to their semantic peers.

• Allow as wide a range of interface styles as possible

• Assess tools on basis of a number of criteria including usability

• Automate (part) of it

23.04.2010

9

Page 10: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Evaluation criteria

User-centred search methodologies will be evaluated according to the following criteria:

• Query expressiveness

• Usability (effectiveness, efficiency, satisfaction)

• Scalability

• Quality of documentation

• Performance

23.04.2010

10

• Is the style of interface suited to the type of query?

• How complex can the queries be?

• How easy is the tool to use?

• How easy is it to formulate the queries?

• How easy is it to work with the answers?• Ability to cope with a large ontology

• Ability to query a large repository in a reasonable time

• Ability to cope with a large amount of results returned• Is it easy to understand?

• Is it well structured?

Resource consumption:

• execution time (speed)

• CPU load

• memory required

Page 11: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Two phase approach

• Semantic search tools evaluation demands a user-in-the-loop phase

– usability criterion

• Two phases:

– User-in-the-loop

– Automated

23.04.2010

11

Page 12: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Evaluation criteria

Each phase will address a different subset of criteria.

• Automated evaluation: query expressiveness, scalability, performance, quality of documentation

• User-in-the-loop: usability, query expressiveness

23.04.2010

12

Page 13: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

RUNNING THE EVALUATION

23.04.2010

13

Page 14: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Automated evaluation

23.04.2010

14

• Tools uploaded to platform. Includes:

– wrapper implementing API

– supporting libraries

• Test data and questions stored on platform

• Workflow specifies details of evaluation sequence

• Evaluation executed offline in batch mode

• Results stored on platform

• Analyses performed and stored on platform

Search tool

API Runtime

Evaluation

Service

SEALS Platform

Page 15: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

User in the loop evaluation

23.04.2010

15

• Performed at tool provider site

• All materials provided

– Controller software

– Instructions (leader and subjects)

– Questionnaires

• Data downloaded from platform

• Results uploaded to platform

Search tool

API

Controller

SEALS

Platform

Over the web

Tool provider machine

Page 16: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

API

• A range of information needs to be acquired from the tool in both phases

• In automated phase, the tool has to be executed and interrogated with no human assistance.

• Interface between the SEALS platform and the tool must be formalised

23.04.2010

16

Page 17: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

API – common

• Load ontology

– success / failure informs the interoperability

• Determine result type

– ranked list or set?

• Results ready?

– used to determine execution time

• Get results

– list of URIs

– number of results to be determined by developer

23.04.2010

17

Page 18: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

API – user in the loop

• User query input complete?

– used to determine input time• Get user query

– String representation of user’s query

– if NL interface, same as text inputted• Get internal query

– String representation of the internal query

– for use with…

23.04.2010

18

Page 19: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

API – automated

• Execute query

– mustn’t constrain tool type to particular format

– tool provider given questions shortly before evaluation is executed

– tool provider converts those questions into some form of ‘internal representation’ which can be serialised as a String

– serialised internal representation passed to this method

23.04.2010

19

Page 20: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

DATA

23.04.2010

21

Page 21: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Data set – user in the loop

• Mooney Natural Language Learning Data– used by previous semantic search evaluation

– simple and well-known domain

– using geography subset• 9 classes

• 11 datatype properties

• 17 object properties and

• 697 instances

– 877 questions already available

23.04.2010

22

Page 22: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Data set – automated

• EvoOnt

– set of object-oriented software source code ontologies

– easy to create different ABox sizes given a TBox

– 5 data set sizes: 1k, 10k, 100k, 1M, 10M triples

– questions generated by software engineers

23.04.2010

23

Page 23: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

RESULTS AND ANALYSES

23.04.2010

24

Page 24: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Questionnaires

3 questionnaires:

• SUS questionnaire

• Extended questionnaire

– similar to SUS in terms of type of question but more detailed

• Demographics questionnaire

23.04.2010

25

Page 25: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

System Usability Scale (SUS) score

• SUS is a Likert scale

• 10-item questionnaire

• Each question has 5 levels (strongly disagree to strongly agree)

• SUS scores have a range of 0 to 100.

• A score of around 60 and above is generally considered as an indicator of good usability.

23.04.2010

26

Page 26: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Demographics

• Age• Gender• Profession• Number of years in education• Highest qualification• Number of years in employment• knowledge of informatics• knowledge of linguistics• knowledge of formal query languages• knowledge of English• …

23.04.2010

27

Page 27: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Automated

Results

• Execution success (OK / FAIL / PLATFORM ERROR)

• Triples returned

• Time to execute each query

• CPU load, memory usage

Analyses

• Ability to load ontology and query (interoperability)

• Precision and Recall (search accuracy and query expressiveness)

• Tool robustness: ratio of all benchmarks executed to number of failed executions

23.04.2010

28

Page 28: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

User in the loop

Results (other than core results similar to automated phase)

• Query captured by the tool

• Underlying query (e.g., SPARQL)

• Is answer in result set? (user may try a number of queries before being successful)

• time required to obtain answer

• number of queries required to answer question

Analyses

• Precision and Recall

• Correlations between results and SUS scores, demographics, etc

23.04.2010

29

Page 29: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Dissemination

• Results browsable on the SEALS portal

• Split into three areas:

– performance

– usability

– comparison between tools

23.04.2010

30

Page 30: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

CONCLUSIONS

23.04.2010

33

Page 31: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Conclusions

• Methodology and design of a semantic search tool evaluation campaign

• Exists within the wider context of the SEALS initiative

• First version

• feedback from participants and community will drive the design of the second campaign

• Emphasis on the user experience (for search)

– Two phase approach

23.04.2010

34

Page 32: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Get involved!

• First Evaluation Campaign in all SEALS technology areas this Summer

• Get involved – your input and participation is crucial

• Workshop planned for ISWC 2010 after campaign

• Find out more (and take part!) at:

http://www.seals-project.eu

or talk to me, or email me ([email protected])

23.04.2010

35

Page 33: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Timeline

• May 2010: Registration opens

• May-June 2010: Evaluation materials and documentation are provided to participants

• July 2010: Participants upload their tools

• August 2010: Evaluation scenarios are executed

• September 2010: Evaluation results are analysed

• November 2010: Evaluation results are discussed at ISWC 2010 workshop (tbc)

23.04.2010

36

Page 34: Methodology and Campaign Design for the Evaluation of ...staff · evaluation –This infrastructure will be the SEALS Platform • Facilitate the continuous evaluation of semantic

Best paper award

SEALS is proud to be sponsoring the best paper award here at SemSearch2010

Congratulations to the winning authors!

23.04.2010

37