Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

KIT – Universität des Landes Baden-Württemberg undnationales Forschungszentrum in der Helmholtz-Gemeinschaft

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN

www.kit.edu

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh TranWWW2012

2 WWW 2012

Agenda

Motivation

Problem Definition

Existing Solutions

Our ApproachEntity Relevance Model (ERM)

Ranking

On-The-Fly alignment

Experiments

Conclusion

Daniel M. Herzig - Institute AIFB

3 WWW 2012

Company running a movie shopping website

Movies Shopping Website

Company’s dataset

4 WWW 2012 Daniel M. Herzig - Institute AIFB

Users search the website via forms. Search request is internally executed as a structured query

Screenshot of http://www.imdb.com/search/title

Structured Query(e.g. SQL, SPARQL)

i:directors

Steven Spielberg

IMdb i:

i:movie

i:year

5 WWW 2012

Company discovers the plethora of Linked Data available on the Web and identifies Data Sources beneficial for its business

Linked Data on the Webhttp://richard.cyganiak.de/2007/10/lod/

6 WWW 2012

Zero Star Mugs!

7 WWW 2012

Problems of Data Integration arise…

qs does not return results

No links, no integration

No knowledge about the

external data schema

External data might change often

8 WWW 2012

Problem Definition

Find relevant entities in a set of target datasets Dt given a source dataset Ds and an structured entity query qs adhering to the vocabulary of Ds.

DsDt1 Dt2

Source Dataset Target Datasets

Structured entity query

9 WWW 2012

Problem Setting

Data Model is labeled directed graphDirectly related to RDF

RDF specifics, e.g. blank nodes, are omitted

Entity query: SPARQL BGP query with one select variableEntity queries are the most frequent type of web search queries, Pound et al. WWW2010

Web Data scenario:Data exhibits a heterogeneity on the schema- and data-level

10 WWW 2012

Heterogeneous Web Data

Schema-level: actors vs. starring

Data-level: Steven Spielberg vs. Spielberg, Steven

Varying number of attributes per entity

a:Movie

Steven Spielberg

a:Directors

Amazon a: DBpedia db:IMdb i:

Munich

a:Title

Daniel Craig, Eric Bana

a:Actors

a:Binding

a:ReleaseDatei:movie

Spielberg, Steven (I)

i:directors

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

db:Film

type db:director

1941 (film)

rdfs:label db:starring

db:John_Candy_(actor)

db:Steven_Spielberg

11 WWW 2012

Aim: Integrate External Data into the Search Process

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewriting based

on up-front data integration

Calì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

12 WWW 2012

Existing Strategies – Keyword Search

Transform qs into keyword query

Match against bag-of-words representation of entities

Bridges schema differences by neglecting the structure

Baseline 1 (KW), IR baseline using Semplore (Lucene)

a:Movie

“Rainer Werner Fassbinder”

a:Directors

Amazon a:

a:TheatricalReleaseDate

directors rainer werner fassbinder theatrical release date 1982 type movie

IMDB i:

Rainer Werner Fassbinder

Veronika Voss

i:released

title veronika voss director rainer werner fassbinder released

i:title

i:director

i:movie

Schindlers Liste (1994)

title schindlers liste 1994 director

spielberg steven i type movie

i:title

i:director

13 WWW 2012

Existing Strategies – Query Rewriting

Create mappings using ontology alignment tools (Falcon AO)

Rewrite query using the mappings, omit missing mappings, replace constants with variables

Reduces the search space, perform keyword search on top

Baseline 2 (QR), database-style baseline

db:director

DBpedia db:

Amazon a: Dbpedia db:a:Directors = db:directora:Title = db:nameA:Actor = db:starring… = …

Ontology Alignment Tool

Schema Amazon

Schema DBpedia

a:Movie

“Rainer Maria Fassbinder”

a:Directors

Amazon a:

14 WWW 2012

Heterogeneous Web Data SearchUsing Relevance-based On The Fly Data Integration

15 WWW 2012

Contributions

(1) Novel approach for querying heterogeneous Web data sources

No upfront data integration necessary

Uses an Entity Relevance Model (ERM) for ranking and for computing mappings on the fly

(2) Implementation of the approachConstruction of an ERM and usage for alignment and ranking

Best-effort algorithm for creating mappings during runtime

(3) Large-scale evaluation with 3 real-world datasetsExperiments show our approach exceeds KW and QR baseline by 120%, respectively 54% in terms of Mean Average Precision.

16 WWW 2012

Overview of our Approach

RsEntity

Relevance Model et

Keyword search to cross vocabulary mismatches

Relevance Feedback

keyword query

Model leveraging the structure of the data

Matching and Ranking

17 WWW 2012

Entity Relevance Model (ERM)

Based on Structured Relevance Model (Lavrenko et.al 2007)

Entity Relevance Model:

Query specific model

Captures structure and content of relevant results

Composite model consisting of language models weighted by occurrence.

Based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

18 WWW 2012

ERM (2)

qs Rs = {e1,e2} ERM

Rainer Werner Fassbinder

director

e11973

World on Wires

released

Klaus Löwitsch

starring

type director

e21982

Veronika Voss

released

Germanlanguage

Barbara Valentinstarring

19 WWW 2012

Modelling Target Entities

Modeled the same way as ERM

Language Model for each attribute

IMdb i:

i:movietype

i:directors

E.T. (1994)

i:title

Coyote, Peter

i:actors

i:producer

20 WWW 2012

Ranking

Rank candidate entities according to their similarity to ERM

Note: Alignment between ERM and et neededIf no mapping available, use max H.

boosting seed query attributes

frequency of as

cross entropy

21 WWW 2012

On The Fly Alignment

as ~ at ??

Compare all language models of et to a field of ERM using cross entropy -H.

Establish a mapping, if lowest value for H is lower than a threshold t.

Worst case: n r comparisonsn , r are usually small

Allows reuse of computed cross entropies for subsequent ranking

22 WWW 2012

EXPERIMENTS

23 WWW 2012

Datasets

Three real-world, heterogeneous Web datasets:

(1) DBpedia 3.5.1, structured representation of Wikipedia

(2) IMdb, information about movies

(3) Amazon, information about DVD/Videos

(2,3) are crawled and transformed to RDF. Provided by L3S

24 WWW 2012

Ground Truth

Goal is to find relevant entities in the target datasets

Manually rewriting the seed query qs to obtain the relevant entities in the target datasets.

3 query sets each with 23 corresponding entity BGP SPARQL queries

a:Movie

“Rainer Werner Fassbinder”

a:Directors

db:director

db:Rainer_Werner_Fassbinder

i:directors

“Fassbinder, Rainer Werner”

Amazon a: DBpedia db: IMdb i:

db:Film

db:released

i:movie

i:year

25 WWW 2012

IR Experiments

Baseline KW – Keyword Search

Baseline QR – Query Rewriting

Three configurations of ERM:ERM – computes alignments on the fly

ERMa – uses pre-computed alignments only

ERMq – uses pre-computed alignments and creates mappings on top

Six different retrieval settings.

26 WWW 2012

Results (1) – Mean Average Precision

ERM improves over KW by 120% and over QR by 54%

ERMa performs slightly better than ERM

ERMq performs best.

27 WWW 2012

Results (2) – On The Fly Alignment

Pooled mappings for n = 115k entities

Average Precision = 0.7, Average Recall = 0.3 for relevant entities

Pearson correlation ρ(MAP, Precision-Rel) = 0.98

28 WWW 2012

Results (3) – Parameter and Runtime Analysis

Analysis on the parameters of the modelSensitivness of retrieval performance in terms of MAP for varying parameter configurations

Runtime analysisExecution takes less than 13s on average

Can be improved by moving tasks (e.g. computation of language models) to index time.

29 WWW 2012

Conclusion

Novel approach for searching entities in a target dataset Dt with a structured query qs adhering to the vocabulary of Ds.

Entity Relevance Model used for ranking and creating mappings during runtime.

Experiments showed that our approach is effective and exceeds the baselines substantially.

30 WWW 2012 Daniel M. Herzig - Institute AIFB

ACKNOWLEDGEMENTS:We thank our colleagues Philipp Sorg and Günter Ladwig for helpful discussions. Also, we thank Julien Gaugaz and the L3S Research Center for providing us their versions of the IMdb and Amazon datasets. This work was supported by the German Federal Ministry of Education and Research (BMBF) under the iGreen project (grant 01IA08005K).

Baseline Keyword Search

Baseline Query Rewriting

OverviewScenario

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Daniel M. Herzig, Thanh Tranherzig@kit.edu

Institute AIFB, Karlsruhe Institute of Technology,Germany

THANK YOU!

31 WWW 2012

Execution Process of our Approach

Entity Relevance

Run qs against Ds to obtain results Rs

Build ERM from Rs

Obtain candidate entities et

Compare et to ERM #

Rank et according to similarity to ERM

32 WWW 2012

Runtime Analysis

Average execution time less than 13 sec for the parameter setting used in the IR experiments.

Increasing parameter c (i.e. reducing the number of fields of ERM) increases performances

Our implementation performed some tasks at runtime, which can be moved to index time

Improvements are easily possible

33 WWW 2012

Parameter Analysis

Model is robust in certain parameter ranges

Boosting b: Beneficial for similar datasets, not so for diverse

Pruning c: Small effect on effectiveness, larger on efficenicy

34 WWW 2012

Boosting Parameter b

If attribute as is present in the seed query, the boosting parameter is set to b, in order to increase its influence during ranking.

35 WWW 2012

Alignment

Compare LMs (Prob distributions) by cross entropy

36 WWW 2012

Related Work (excerpt)

Keyword SearchWang et al.: Semplore: A scalable IR approach to search the Web of Data. In: Journal of Web Semantics. (2009)

Query rewritingCalì et al.: Query Rewriting and Answering under Constraints in Data Integration Systems. In: IJCAI. (2003)

Our approach is based onLavrenko et al.: Information Retrieval on Empty Fields. In: HLT- NAACL. (2007)

Madhavan et al.: Web-scale Data Integration: You can afford to pay as you go. In: CIDR. (2007)

Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration

Documents

Consolidating Heterogeneous Enterprise Data for … · Consolidating Heterogeneous Enterprise Data for Named ... linked data repositories, ... Interlinking company data with public

Part 4b: Learning from Heterogeneous Data

Meta Structure: Compu/ng Relevance in Large Heterogeneous … · Nikos Mamoulis, Xiang Li, “Meta Structure: Compu/ng Relevance in Large Heterogeneous Informaon Networks”, SIGKDD’

A Data Quality Methodology for Heterogeneous Data

Merging heterogeneous network measurement data

Accommodating heterogeneous missing data patterns for

Relevance Search in Heterogeneous Networksweb.cs.wpi.edu/~xkong/publications/papers/edbt12.pdfessential to provide a relevance search function on diﬀerent-typed objects in such networks,

Heterogeneous Data Replication - CiteSeerX

Developing Data-Intensive Applications for Heterogeneous

Connecting Heterogeneous Collections using Linked Data

Heterogeneous data orchestration Interactive fantasy under ... · Heterogeneous data orchestration Interactive fantasy under SuperCollider Sébastien Clara CIEREC UJM – PhD student

Uniaxial and Heterogeneous Tensile Test Data

K-Relevance: A Spectrum of Relevance for Data …mpetropo/CSE718-SP08/pubs/k...K-Relevance: A Spectrum of Relevance for Data Sources Impacting a Query ∗ Jiansheng Huang and Jeffrey

Data mining and integration of heterogeneous ... · Data mining and integration of heterogeneous bioinformatics data sources Badr H. Al-Daihani Al-Mutairy

Causal Discovery from Heterogeneous/Nonstationary Data

Graph-based Interactive Data Federation System for Heterogeneous Data ... · Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics Xuan-Son

Heterogeneous Data Networks

ETL TESTING-Handling Heterogeneous Data Formats

Introduction to Heterogeneous Data Replication

A Military Vignette for a Heterogeneous Data Proximity ... · PDF fileA Military Vignette for a Heterogeneous Data Proximity Tool (HDPT) ... A Military Vignette for a Heterogeneous