32
Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

Embed Size (px)

Citation preview

Ralf Schenkel

joint work with Jens Graupmann and Gerhard Weikum

The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents

VLDB 2005, Trondheim, Norway 2

Outline

• Where existing search engines fail

• SphereSearch Concepts

• Transformation and Annotation

• Query Language and Scoring

• Experimental Evaluation

• Summary

VLDB 2005, Trondheim, Norway 3

Example query #1

Which professors from Saarbrücken do research on XML

Different terminology in query and Web pages

Director of Department 5 DBS & IS

Professor atSaarland University

Abstraction Awareness

VLDB 2005, Trondheim, Norway 4

Example query #2

Conferences about XML in Norway 2005?

Context Awareness

Information is not present on a single page, but distributed across linked pages

VLDB Conference 2005, Trondheim, Norway

Call for Papers…XML…

VLDB 2005, Trondheim, Norway 5

What are the publications of Max Planck?

Example query #3

Max Planck should be instance of concept person, not of concept institute

Concept Awareness

VLDB 2005, Trondheim, Norway 6

SphereSearch Concepts

• Unified search for unstructured, semistructured, structured data from heterogeneous sources

• Graph-based model, including links• Annotation engines from NLP to recognize classes

of named entities (persons, locations, dates, …) for concept-aware queries

• Flexible yet simple abstraction-aware query language with context-aware scoring

• Compactness-based scores

Goal: Increase recall & precision for hard queries on linked and heterogeneous data

VLDB 2005, Trondheim, Norway 7

Some Related Work

• Web Query Languagese.g., W3QS [VLDB95], WebOQL [ICDE95],…

• Web IR with thesaurie.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],…

• XML IRe.g., XXL [WebDB00], XIRQL [SIGIR01],XSearch [VLDB93], XRank [SIGMOD03], …

• Information extractione.g., Lixto, KnowItAll, …

• Advanced Web graph IRe.g., BANKS [ICDE02], Hristidis et al.[VLDB03], …

VLDB 2005, Trondheim, Norway 8

Outline

• Where existing search engines fail

• SphereSearch Concepts

• Transformation and Annotation

• Query Language and Scoring

• Experimental Evaluation

• Current and Future Work

VLDB 2005, Trondheim, Norway 9

Unifying Search on Heterogeneous Data

Web

Intranet

Databases

EnterpriseInformation

Systems

XML

Heuristics, type-spec transformations

VLDB 2005, Trondheim, Norway 10

Heuristic Transformation of HTML

• Headlines<h1>Experiments</h1><h2>Settings</h2>We evaluated...<h2>Results</h2>Our system...

Goal: Transform layout tagsto semantic annotations

<Experiments><Settings>...</Settings><Results>...</Results>

</Experiments>

• Patterns<b>Topic:</b>XML <Topic>XML</Topic>

• Rules for tables, lists, …

VLDB 2005, Trondheim, Norway 11

(Almost) Generic XML Data Model<Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research></Professor>

1

docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“

32

docid=1tag=“Research“content=“XML“

docid=1tag=“Course“content=“IR“

Automatic annotation of important concepts (persons, locations, dates,

money amounts) with tools from Information Extraction

Tags annotate content with corresponding concept

person

location

VLDB 2005, Trondheim, Norway 12

Information Extraction (IE)

The Pelican Hotel in Salvador, operated byRoberto Cardoso, offers comfortable rooms starting at$100 a night, including breakfast. Please check in before 7pm.

The <company> Pelican Hotel </company> in<location> Salvador </location>, operated by<person> Roberto Cardoso </person>, offerscomfortable rooms starting at<price> $100 </price> a night, includingbreakfast. Please check in before <time> 7pm </time>.

• Named Entity Recognition (NER)• Named Entity ~ abstract datatype, concept

(location, person,…, IP-address) • Mature (out-of-the-box products, e.g. GATE/ANNIE)• Extensible

VLDB 2005, Trondheim, Norway 13

Unifying Search on Heterogeneous Data

Web

Intranet

Databases

EnterpriseInformation

Systems

XML

Heuristics, type-spec transformations

AnnotatedXML

Annotation of named entitieswith IE tools (e.g., GATE)

VLDB 2005, Trondheim, Norway 14

Annotation-Aware Data Model<Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research></Professor>

1docid=1tag=“Professor“content=“Gerhard Weikum Saarbrücken“

32docid=1tag=“Research“content=“XML“

docid=1tag=“Course“content=“IR“

2

1

docid=1tag=„Professor“content=“Gerhard Weikum“

3

docid=1tag=“Research“content=“XML“

docid=1tag=“Course“content=“IR“

4

docid=1tag=“location“content=“Saarbrücken“

Annotation with GATE:„Saarbrücken“ of type „location“

Annotation introduces new tags

VLDB 2005, Trondheim, Norway 15

Data Model for Links

VLDB 2005, Trondheim, Norway 16

Architecture

Tourist

Guide

(XML)

HotelWebsiteFlight

Schedule

INDEX

Web Portal

Adapter

Search Engine

XML

Adapter

Location= Salvador

Price =89 $

Date = 15-18 August

Event=SIGIRLocation=Salvador

Location= Frankfurt

Location=Salvador

Time = 13:15

SIGIRWebsite

EMail

Adapter

IE ProcessorAnnotation Module

DATE

Annotation Module

PRICE

……

Annotation Module

LOCATION

Person=Schenkel

FROM=SIGIR

SUBJECT=Notification

Web Adapter

Homepage

GraupmannSources

Adapters

Annotators

Search

Engine

VLDB 2005, Trondheim, Norway 17

Outline

• Where existing search engines fail

• SphereSearch Concepts

• Transformation and Annotation

• Query Language and Scoring

• Experimental Evaluation

• Current and Future Work

VLDB 2005, Trondheim, Norway 18

SphereSearch Queries

Extended keyword queries:• similarity conditions ~professor, ~Saarbrücken

• concept-based conditions person=Max Planck, location=Trondheim

• grouping

• join conditions

Ranked results with context-aware scoring

Ralf Schenkel
symbolische Darstellung einer Query. wird auf allen Folien zur Querysprache benutzt und jeweils um das gerade eingeführte Feature ergänzt. Am Anfang nur keywords (A B C D), dann CV (A loc=B C D), dann Ähnlichkeitsbedingungen (A loc~B ~C D), dann Gruppireung, dann Joins.

VLDB 2005, Trondheim, Norway 19

Score Aggregation: SphereScore

Weighted aggregation of local scores in environment of element (sphere score):

2

1

1

2

0 ':( , ')

( ) ( '), 0 1D

dL

d edist e e d

s e s e

Rewards proximity of terms and compactness of term distribution

s(1):

research XMLLocal score sL(e) for each element e (tf/idf, BM25,…)

Context awareness

VLDB 2005, Trondheim, Norway 20

Similarity Conditions

wizard

intellectual

artist

alchemist

directorprimadonna

lecturer

professor

teacher

educator

scholar

academic,academician,faculty member

scientist

researcher

HYPONYM (0.7)HYPONYM (0.7)

Thesaurus/Ontology:concepts, relationships, glossesfrom WordNet, Gazetteers, Web forms & tables, Wikipedia

relationships quantified bystatistical co-occurence measures

investigator

mentor

Similarity conditions like~professor, ~Saarbrücken

Query expansion

Local score: weighted max over all expansion terms

disambiguation

δ-exp(x)={w|sim(x,w)>δ}

sL(e,~professor) =max tδ-exp(professor) {sim(professor,t)*sL(e,t)}

Abstraction awareness

VLDB 2005, Trondheim, Norway 21

Concept-based conditions

Concept awareness

Goal: Exploit explicit (tags) and automatic annotations in documents

location=Trondheimconcept value e

docid=1tag=„location“content=“Trondheim“

Allows similarity and range queries (for annotated concepts) likelocation~Trondheim1970<date<1980with concept-specific distancemeasures

sL(e,c=v)= score for concept-tag match + score for value-content-match

concept-specific

VLDB 2005, Trondheim, Norway 22

Query Groups

Group conditions that relate to the same „entity“ professor teaching IR research XML

professor T(teaching IR) R(research XML)

SphereScore computed for each group

Find compact sets with one result for each group

Goal: Related terms should occur in the same context

VLDB 2005, Trondheim, Norway 23

Scores for Query Resultsquery result R: one result per query group

( ) ( ) (1 ) ( )i

i ie R

score R s e compactness R

A

X

B

2

1

compactness ~ 1/size of a minimal spanning tree

A

1

X3

11

1( )

3C N

2

A

2

X3

4

B

1

X5

3

B

2

X5

6

1

1

2

21

( )4

C N

31

( )5

C N

41

( )6

C N

Context awareness

VLDB 2005, Trondheim, Norway 24

Join conditions

Goal: Connect results of different query groups

A(research, XML)

B(VLDB 2005 paper)

A.person=B.person

Dependent on database size, application

• Precomputed• Computed during query execution

researchresearch

XMLXML

Ralf Ralf SchenkelSchenkel 20042004

20052005

R.SchenkelR.Schenkel

VLDBVLDB

20052005

1.0

0.9

•Join conditions do not change the score for a node•Join conditions create a new link with a specific weight

A

B

VLDB 2005, Trondheim, Norway 25

Score for Join Conditions

Join condition A.T=B.S:

• For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight sim(n1,n2))-1

• sim(n1,n2): content-based similarity

A

X

B

2

1

B

2

X2

3 14

1( )

3C N

VLDB 2005, Trondheim, Norway 26

Outline

• Where existing search engines fail

• SphereSearch Concepts

• Transformation and Annotation

• Query Language and Scoring

• Experimental Evaluation

• Current and Future Work

VLDB 2005, Trondheim, Norway 27

Setup for Experiments

Three corpora:• Wikipedia• extended Wikipedia with links to IMDB• extended DBLP corpus with links to homepages

50 Queries like• A(actor birthday 1970<date<1980) western• G(California,governor) M(movie)• A(Madonna,husband) B(director)

A.person=B.director

Opponent: keyword queries with standard TF/IDF-based score „simplified Google“

No existing benchmark (INEX, TREC, …) fits

VLDB 2005, Trondheim, Norway 28

SSE-Join(join conditions)

SSE-QG(query groups)

SSE-CV(concept-based conditions)

Incremental Language Levels

SSE-basic(keywords, SphereScores)

VLDB 2005, Trondheim, Norway 29

Experimental Results on Wikipdia

VLDB 2005, Trondheim, Norway 30

Experimental Results on Wiki++ and DBLP++

• SphereScores better than local scores

• New SSE features nearly double precision

VLDB 2005, Trondheim, Norway 31

Current and Future Work• Improve graphical user interface

• Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005]

• Deep Web search through automatic portal queries

• Parameter tuning with relevance feedback

• Efficiency of query evaluation through precomputation and integrated top-k(TopX talk this afternoon)

VLDB 2005, Trondheim, Norway 32

Thank you!