36
KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN www.kit.edu Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration Daniel M. Herzig , Thanh Tran WWW2012

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Embed Size (px)

Citation preview

Page 1: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

KIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN

www.kit.edu

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh TranWWW2012

Page 2: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

2 WWW 2012

Agenda

Motivation

Problem Definition

Existing Solutions

Our ApproachEntity Relevance Model (ERM)

Ranking

On-The-Fly alignment

Experiments

Conclusion

Daniel M. Herzig - Institute AIFB

Page 3: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

3 WWW 2012

Company running a movie shopping website

Daniel M. Herzig - Institute AIFB

Ds

Movies Shopping Website

Company’s dataset

Page 4: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

4 WWW 2012 Daniel M. Herzig - Institute AIFB

Users search the website via forms. Search request is internally executed as a structured query

Screenshot of http://www.imdb.com/search/title

Dsqs

Structured Query(e.g. SQL, SPARQL)

i:directors

?x

Steven Spielberg

IMdb i:

i:movie

type

1982

i:year

Page 5: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

5 WWW 2012

Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business

Daniel M. Herzig - Institute AIFB

Ds

Linked Data on the Webhttp://richard.cyganiak.de/2007/10/lod/

qs

Page 6: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

6 WWW 2012

Zero Star Mugs!

Daniel M. Herzig - Institute AIFB

vs.

Page 7: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

7 WWW 2012

Problems of Data Integration arise…

qs does not return results

No links, no integration

No knowledge about the

external data schema

External data might change often

Daniel M. Herzig - Institute AIFB

Ds

qs

Page 8: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

8 WWW 2012

Problem Definition

Find relevant entities in a set of target datasets Dt given a source dataset Ds and an structured entity query qs adhering to the vocabulary of Ds.

Daniel M. Herzig - Institute AIFB

qs

DsDt1 Dt2

?

Source Dataset Target Datasets

Structured entity query

Page 9: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

9 WWW 2012

Problem Setting

Data Model is labeled directed graphDirectly related to RDF

RDF specifics, e.g. blank nodes, are omitted

Entity query: SPARQL BGP query with one select variableEntity queries are the most frequent type of web search queries, Pound et al. WWW2010

Web Data scenario:Data exhibits a heterogeneity on the schema- and data-level

Daniel M. Herzig - Institute AIFB

Page 10: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

10 WWW 2012

Heterogeneous Web Data

Schema-level: actors vs. starring

Data-level: Steven Spielberg vs. Spielberg, Steven

Varying number of attributes per entity

Daniel M. Herzig - Institute AIFB

a:Movie

type

Steven Spielberg

a:Directors

ea

Amazon a: DBpedia db:IMdb i:

Munich

a:Title

Daniel Craig, Eric Bana

DVD

a:Actors

a:Binding

2005

a:ReleaseDatei:movie

type

Spielberg, Steven (I)

i:directors

ei

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

Spielberg, Steven (I)

db:Film

type db:director

ed

1941 (film)

rdfs:label db:starring

db:John_Candy_(actor)

db:Steven_Spielberg

Page 11: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

11 WWW 2012

Aim: Integrate External Data into the Search Process

Daniel M. Herzig - Institute AIFB

Ds

qsDt

Dt

?

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewriting based

on up-front data integration

Calì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

Page 12: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

12 WWW 2012

Existing Strategies – Keyword Search

Transform qs into keyword query

Match against bag-of-words representation of entities

Bridges schema differences by neglecting the structure

Baseline 1 (KW), IR baseline using Semplore (Lucene)

Daniel M. Herzig - Institute AIFB

a:Movie

type

“Rainer Werner Fassbinder”

a:Directors

?x

Amazon a:

1982

a:TheatricalReleaseDate

directors rainer werner fassbinder theatrical release date 1982 type movie

e1

1982

IMDB i:

Rainer Werner Fassbinder

Veronika Voss

i:released

title veronika voss director rainer werner fassbinder released

1982

i:title

i:director

e2

i:movie

Spielberg, Steven (I)

Schindlers Liste (1994)

type

title schindlers liste 1994 director

spielberg steven i type movie

i:title

i:director

(3)

(2)

(1)

e1

e2

Page 13: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

13 WWW 2012

Existing Strategies – Query Rewriting

Create mappings using ontology alignment tools (Falcon AO)

Rewrite query using the mappings, omit missing mappings, replace constants with variables

Reduces the search space, perform keyword search on top

Baseline 2 (QR), database-style baseline

Daniel M. Herzig - Institute AIFB

db:director

?x

DBpedia db:

type

?y

?z

Amazon a: Dbpedia db:a:Directors = db:directora:Title = db:nameA:Actor = db:starring… = …

Ontology Alignment Tool

Schema Amazon

Schema DBpedia

a:Movie

type

“Rainer Maria Fassbinder”

a:Directors

?x

Amazon a:

1982

a:TheatricalReleaseDate

Page 14: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

14 WWW 2012

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

Daniel M. Herzig - Institute AIFB

Page 15: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

15 WWW 2012

Contributions

(1) Novel approach for querying heterogeneous Web data sources

No upfront data integration necessary

Uses an Entity Relevance Model (ERM) for ranking and for computing mappings on the fly

(2) Implementation of the approachConstruction of an ERM and usage for alignment and ranking

Best-effort algorithm for creating mappings during runtime

(3) Large-scale evaluation with 3 real-world datasetsExperiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.

Daniel M. Herzig - Institute AIFB

Page 16: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

16 WWW 2012

Overview of our Approach

Daniel M. Herzig - Institute AIFB

qs

Ds

RsEntity

Relevance Model et

et

et

et

Keyword search to cross vocabulary mismatches

Relevance Feedback

Dt

Dt

keyword query

Model leveraging the structure of the data

Matching and Ranking

Page 17: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

17 WWW 2012

Entity Relevance Model (ERM)

Based on Structured Relevance Model (Lavrenko et.al 2007)

Entity Relevance Model:

Query specific model

Captures structure and content of relevant results

Composite model consisting of language models weighted by occurrence.

Based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

Daniel M. Herzig - Institute AIFB

Page 18: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

18 WWW 2012

ERM (2)

qs Rs = {e1,e2} ERM

Daniel M. Herzig - Institute AIFB

Film

type

Rainer Werner Fassbinder

director

e11973

label

World on Wires

released

Klaus Löwitsch

starring

type director

e21982

Veronika Voss

released

label

Germanlanguage

Barbara Valentinstarring

Page 19: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

19 WWW 2012

Modelling Target Entities

Modeled the same way as ERM

Language Model for each attribute

Daniel M. Herzig - Institute AIFB

IMdb i:

i:movietype

Spielberg, Steven (I)

i:directors

ei

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

Spielberg, Steven (I)

Page 20: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

20 WWW 2012

Ranking

Idea:

Rank candidate entities according to their similarity to ERM

Note: Alignment between ERM and et neededIf no mapping available, use max H.

Daniel M. Herzig - Institute AIFB

boosting seed query attributes

frequency of as

cross entropy

Page 21: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

21 WWW 2012

On The Fly Alignment

as ~ at ??

Idea:

Compare all language models of et to a field of ERM using cross entropy -H.

Establish a mapping, if lowest value for H is lower than a threshold t.

Worst case: n r comparisonsn , r are usually small

Allows reuse of computed cross entropies for subsequent ranking

Daniel M. Herzig - Institute AIFB

Page 22: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

22 WWW 2012

EXPERIMENTS

Daniel M. Herzig - Institute AIFB

Page 23: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

23 WWW 2012

Datasets

Three real-world, heterogeneous Web datasets:

(1) DBpedia 3.5.1, structured representation of Wikipedia

(2) IMdb, information about movies

(3) Amazon, information about DVD/Videos

(2,3) are crawled and transformed to RDF. Provided by L3S

Daniel M. Herzig - Institute AIFB

Page 24: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

24 WWW 2012

Ground Truth

Goal is to find relevant entities in the target datasets

Manually rewriting the seed query qs to obtain the relevant entities in the target datasets.

3 query sets each with 23 corresponding entity BGP SPARQL queries

Daniel M. Herzig - Institute AIFB

a:Movie

type

“Rainer Werner Fassbinder”

a:Directors

?x

db:director

?x

db:Rainer_Werner_Fassbinder

i:directors

?x

“Fassbinder, Rainer Werner”

Amazon a: DBpedia db: IMdb i:

1982

a:TheatricalReleaseDate

db:Film

type

1982

db:released

i:movie

type

1982

i:year

Page 25: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

25 WWW 2012

IR Experiments

Baseline KW – Keyword Search

Baseline QR – Query Rewriting

Three configurations of ERM:ERM – computes alignments on the fly

ERMa – uses pre-computed alignments only

ERMq – uses pre-computed alignments and creates mappings on top

Six different retrieval settings.

Daniel M. Herzig - Institute AIFB

Page 26: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

26 WWW 2012

Results (1) – Mean Average Precision

ERM improves over KW by 120% and over QR by 54%

ERMa performs slightly better than ERM

ERMq performs best.

Daniel M. Herzig - Institute AIFB

Page 27: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

27 WWW 2012

Results (2) – On The Fly Alignment

Pooled mappings for n = 115k entities

Average Precision = 0.7, Average Recall = 0.3 for relevant entities

Pearson correlation ρ(MAP, Precision-Rel) = 0.98

Daniel M. Herzig - Institute AIFB

Page 28: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

28 WWW 2012

Results (3) – Parameter and Runtime Analysis

Analysis on the parameters of the modelSensitivness of retrieval performance in terms of MAP for varying parameter configurations

Runtime analysisExecution takes less than 13s on average

Can be improved by moving tasks (e.g. computation of language models) to index time.

Daniel M. Herzig - Institute AIFB

Page 29: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

29 WWW 2012

Conclusion

Novel approach for searching entities in a target dataset Dt with a structured query qs adhering to the vocabulary of Ds.

Entity Relevance Model used for ranking and creating mappings during runtime.

Experiments showed that our approach is effective and exceeds the baselines substantially.

Daniel M. Herzig - Institute AIFB

Page 30: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

30 WWW 2012 Daniel M. Herzig - Institute AIFB

ACKNOWLEDGEMENTS:We thank our colleagues Philipp Sorg and Günter Ladwig for helpful discussions. Also, we thank Julien Gaugaz and the L3S Research Center for providing us their versions of the IMdb and Amazon datasets. This work was supported by the German Federal Ministry of Education and Research (BMBF) under the iGreen project (grant 01IA08005K).

Baseline Keyword Search

Baseline Query Rewriting

OverviewScenario

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh [email protected]

Institute AIFB, Karlsruhe Institute of Technology,Germany

THANK YOU!

Page 31: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

31 WWW 2012

Execution Process of our Approach

Daniel M. Herzig - Institute AIFB

qs

Ds

Entity Relevance

Model

Dt

Dt

Rs et

et

et

et

Run qs against Ds to obtain results Rs

Build ERM from Rs

Obtain candidate entities et

Compare et to ERM #

Rank et according to similarity to ERM

Page 32: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

32 WWW 2012

Runtime Analysis

Daniel M. Herzig - Institute AIFB

Average execution time less than 13 sec for the parameter setting used in the IR experiments.

Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances

Our implementation performed some tasks at runtime, which can be moved to index time

Improvements are easily possible

Page 33: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

33 WWW 2012

Parameter Analysis

Model is robust in certain parameter ranges

Boosting b: Beneficial for similar datasets, not so for diverse

Pruning c: Small effect on effectiveness, larger on efficenicy

Daniel M. Herzig - Institute AIFB

Page 34: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

34 WWW 2012

Boosting Parameter b

If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.

Daniel M. Herzig - Institute AIFB

Page 35: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

35 WWW 2012

Alignment

ERM

Compare LMs (Prob distributions) by cross entropy

et

Daniel M. Herzig - Institute AIFB

Page 36: Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

36 WWW 2012

Related Work (excerpt)

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewritingCalì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

Our approach is based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

Madhavan et al.: Web-scale Data Integration: You can afford to pay as you go. In: CIDR. (2007)

Daniel M. Herzig - Institute AIFB