Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Preview:

Citation preview

KIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN

www.kit.edu

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh TranWWW2012

2 WWW 2012

Agenda

Motivation

Problem Definition

Existing Solutions

Our ApproachEntity Relevance Model (ERM)

Ranking

On-The-Fly alignment

Experiments

Conclusion

Daniel M. Herzig - Institute AIFB

3 WWW 2012

Company running a movie shopping website

Daniel M. Herzig - Institute AIFB

Ds

Movies Shopping Website

Company’s dataset

4 WWW 2012 Daniel M. Herzig - Institute AIFB

Users search the website via forms. Search request is internally executed as a structured query

Screenshot of http://www.imdb.com/search/title

Dsqs

Structured Query(e.g. SQL, SPARQL)

i:directors

?x

Steven Spielberg

IMdb i:

i:movie

type

1982

i:year

5 WWW 2012

Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business

Daniel M. Herzig - Institute AIFB

Ds

Linked Data on the Webhttp://richard.cyganiak.de/2007/10/lod/

qs

6 WWW 2012

Zero Star Mugs!

Daniel M. Herzig - Institute AIFB

vs.

7 WWW 2012

Problems of Data Integration arise…

qs does not return results

No links, no integration

No knowledge about the

external data schema

External data might change often

Daniel M. Herzig - Institute AIFB

Ds

qs

8 WWW 2012

Problem Definition

Find relevant entities in a set of target datasets Dt given a source dataset Ds and an structured entity query qs adhering to the vocabulary of Ds.

Daniel M. Herzig - Institute AIFB

qs

DsDt1 Dt2

?

Source Dataset Target Datasets

Structured entity query

9 WWW 2012

Problem Setting

Data Model is labeled directed graphDirectly related to RDF

RDF specifics, e.g. blank nodes, are omitted

Entity query: SPARQL BGP query with one select variableEntity queries are the most frequent type of web search queries, Pound et al. WWW2010

Web Data scenario:Data exhibits a heterogeneity on the schema- and data-level

Daniel M. Herzig - Institute AIFB

10 WWW 2012

Heterogeneous Web Data

Schema-level: actors vs. starring

Data-level: Steven Spielberg vs. Spielberg, Steven

Varying number of attributes per entity

Daniel M. Herzig - Institute AIFB

a:Movie

type

Steven Spielberg

a:Directors

ea

Amazon a: DBpedia db:IMdb i:

Munich

a:Title

Daniel Craig, Eric Bana

DVD

a:Actors

a:Binding

2005

a:ReleaseDatei:movie

type

Spielberg, Steven (I)

i:directors

ei

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

Spielberg, Steven (I)

db:Film

type db:director

ed

1941 (film)

rdfs:label db:starring

db:John_Candy_(actor)

db:Steven_Spielberg

11 WWW 2012

Aim: Integrate External Data into the Search Process

Daniel M. Herzig - Institute AIFB

Ds

qsDt

Dt

?

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewriting based

on up-front data integration

Calì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

12 WWW 2012

Existing Strategies – Keyword Search

Transform qs into keyword query

Match against bag-of-words representation of entities

Bridges schema differences by neglecting the structure

Baseline 1 (KW), IR baseline using Semplore (Lucene)

Daniel M. Herzig - Institute AIFB

a:Movie

type

“Rainer Werner Fassbinder”

a:Directors

?x

Amazon a:

1982

a:TheatricalReleaseDate

directors rainer werner fassbinder theatrical release date 1982 type movie

e1

1982

IMDB i:

Rainer Werner Fassbinder

Veronika Voss

i:released

title veronika voss director rainer werner fassbinder released

1982

i:title

i:director

e2

i:movie

Spielberg, Steven (I)

Schindlers Liste (1994)

type

title schindlers liste 1994 director

spielberg steven i type movie

i:title

i:director

(3)

(2)

(1)

e1

e2

13 WWW 2012

Existing Strategies – Query Rewriting

Create mappings using ontology alignment tools (Falcon AO)

Rewrite query using the mappings, omit missing mappings, replace constants with variables

Reduces the search space, perform keyword search on top

Baseline 2 (QR), database-style baseline

Daniel M. Herzig - Institute AIFB

db:director

?x

DBpedia db:

type

?y

?z

Amazon a: Dbpedia db:a:Directors = db:directora:Title = db:nameA:Actor = db:starring… = …

Ontology Alignment Tool

Schema Amazon

Schema DBpedia

a:Movie

type

“Rainer Maria Fassbinder”

a:Directors

?x

Amazon a:

1982

a:TheatricalReleaseDate

14 WWW 2012

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

Daniel M. Herzig - Institute AIFB

15 WWW 2012

Contributions

(1) Novel approach for querying heterogeneous Web data sources

No upfront data integration necessary

Uses an Entity Relevance Model (ERM) for ranking and for computing mappings on the fly

(2) Implementation of the approachConstruction of an ERM and usage for alignment and ranking

Best-effort algorithm for creating mappings during runtime

(3) Large-scale evaluation with 3 real-world datasetsExperiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.

Daniel M. Herzig - Institute AIFB

16 WWW 2012

Overview of our Approach

Daniel M. Herzig - Institute AIFB

qs

Ds

RsEntity

Relevance Model et

et

et

et

Keyword search to cross vocabulary mismatches

Relevance Feedback

Dt

Dt

keyword query

Model leveraging the structure of the data

Matching and Ranking

17 WWW 2012

Entity Relevance Model (ERM)

Based on Structured Relevance Model (Lavrenko et.al 2007)

Entity Relevance Model:

Query specific model

Captures structure and content of relevant results

Composite model consisting of language models weighted by occurrence.

Based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

Daniel M. Herzig - Institute AIFB

18 WWW 2012

ERM (2)

qs Rs = {e1,e2} ERM

Daniel M. Herzig - Institute AIFB

Film

type

Rainer Werner Fassbinder

director

e11973

label

World on Wires

released

Klaus Löwitsch

starring

type director

e21982

Veronika Voss

released

label

Germanlanguage

Barbara Valentinstarring

19 WWW 2012

Modelling Target Entities

Modeled the same way as ERM

Language Model for each attribute

Daniel M. Herzig - Institute AIFB

IMdb i:

i:movietype

Spielberg, Steven (I)

i:directors

ei

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

Spielberg, Steven (I)

20 WWW 2012

Ranking

Idea:

Rank candidate entities according to their similarity to ERM

Note: Alignment between ERM and et neededIf no mapping available, use max H.

Daniel M. Herzig - Institute AIFB

boosting seed query attributes

frequency of as

cross entropy

21 WWW 2012

On The Fly Alignment

as ~ at ??

Idea:

Compare all language models of et to a field of ERM using cross entropy -H.

Establish a mapping, if lowest value for H is lower than a threshold t.

Worst case: n r comparisonsn , r are usually small

Allows reuse of computed cross entropies for subsequent ranking

Daniel M. Herzig - Institute AIFB

22 WWW 2012

EXPERIMENTS

Daniel M. Herzig - Institute AIFB

23 WWW 2012

Datasets

Three real-world, heterogeneous Web datasets:

(1) DBpedia 3.5.1, structured representation of Wikipedia

(2) IMdb, information about movies

(3) Amazon, information about DVD/Videos

(2,3) are crawled and transformed to RDF. Provided by L3S

Daniel M. Herzig - Institute AIFB

24 WWW 2012

Ground Truth

Goal is to find relevant entities in the target datasets

Manually rewriting the seed query qs to obtain the relevant entities in the target datasets.

3 query sets each with 23 corresponding entity BGP SPARQL queries

Daniel M. Herzig - Institute AIFB

a:Movie

type

“Rainer Werner Fassbinder”

a:Directors

?x

db:director

?x

db:Rainer_Werner_Fassbinder

i:directors

?x

“Fassbinder, Rainer Werner”

Amazon a: DBpedia db: IMdb i:

1982

a:TheatricalReleaseDate

db:Film

type

1982

db:released

i:movie

type

1982

i:year

25 WWW 2012

IR Experiments

Baseline KW – Keyword Search

Baseline QR – Query Rewriting

Three configurations of ERM:ERM – computes alignments on the fly

ERMa – uses pre-computed alignments only

ERMq – uses pre-computed alignments and creates mappings on top

Six different retrieval settings.

Daniel M. Herzig - Institute AIFB

26 WWW 2012

Results (1) – Mean Average Precision

ERM improves over KW by 120% and over QR by 54%

ERMa performs slightly better than ERM

ERMq performs best.

Daniel M. Herzig - Institute AIFB

27 WWW 2012

Results (2) – On The Fly Alignment

Pooled mappings for n = 115k entities

Average Precision = 0.7, Average Recall = 0.3 for relevant entities

Pearson correlation ρ(MAP, Precision-Rel) = 0.98

Daniel M. Herzig - Institute AIFB

28 WWW 2012

Results (3) – Parameter and Runtime Analysis

Analysis on the parameters of the modelSensitivness of retrieval performance in terms of MAP for varying parameter configurations

Runtime analysisExecution takes less than 13s on average

Can be improved by moving tasks (e.g. computation of language models) to index time.

Daniel M. Herzig - Institute AIFB

29 WWW 2012

Conclusion

Novel approach for searching entities in a target dataset Dt with a structured query qs adhering to the vocabulary of Ds.

Entity Relevance Model used for ranking and creating mappings during runtime.

Experiments showed that our approach is effective and exceeds the baselines substantially.

Daniel M. Herzig - Institute AIFB

30 WWW 2012 Daniel M. Herzig - Institute AIFB

ACKNOWLEDGEMENTS:We thank our colleagues Philipp Sorg and Günter Ladwig for helpful discussions. Also, we thank Julien Gaugaz and the L3S Research Center for providing us their versions of the IMdb and Amazon datasets. This work was supported by the German Federal Ministry of Education and Research (BMBF) under the iGreen project (grant 01IA08005K).

Baseline Keyword Search

Baseline Query Rewriting

OverviewScenario

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh Tranherzig@kit.edu

Institute AIFB, Karlsruhe Institute of Technology,Germany

THANK YOU!

31 WWW 2012

Execution Process of our Approach

Daniel M. Herzig - Institute AIFB

qs

Ds

Entity Relevance

Model

Dt

Dt

Rs et

et

et

et

Run qs against Ds to obtain results Rs

Build ERM from Rs

Obtain candidate entities et

Compare et to ERM #

Rank et according to similarity to ERM

32 WWW 2012

Runtime Analysis

Daniel M. Herzig - Institute AIFB

Average execution time less than 13 sec for the parameter setting used in the IR experiments.

Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances

Our implementation performed some tasks at runtime, which can be moved to index time

Improvements are easily possible

33 WWW 2012

Parameter Analysis

Model is robust in certain parameter ranges

Boosting b: Beneficial for similar datasets, not so for diverse

Pruning c: Small effect on effectiveness, larger on efficenicy

Daniel M. Herzig - Institute AIFB

34 WWW 2012

Boosting Parameter b

If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.

Daniel M. Herzig - Institute AIFB

35 WWW 2012

Alignment

ERM

Compare LMs (Prob distributions) by cross entropy

et

Daniel M. Herzig - Institute AIFB

36 WWW 2012

Related Work (excerpt)

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewritingCalì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

Our approach is based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

Madhavan et al.: Web-scale Data Integration: You can afford to pay as you go. In: CIDR. (2007)

Daniel M. Herzig - Institute AIFB

Recommended