64
Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian

Probabilistic answers to relational queries (PARQ)

  • Upload
    shani

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Probabilistic answers to relational queries (PARQ). Octavian Udrea Yu Deng Edward Hung V. S. Subrahmanian. Content. Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work. Content. - PowerPoint PPT Presentation

Citation preview

Page 1: Probabilistic answers to relational queries (PARQ)

Probabilistic answers to relational queries (PARQ)

Octavian UdreaYu DengEdward HungV. S. Subrahmanian

Page 2: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 3: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 4: Probabilistic answers to relational queries (PARQ)

Motivation

Query algebras do not take semantics into account when computing answers

Data is not always precise Ambiguity, insufficient information

Goal: Use probabilistic ontologies to improve query answer recall and quality

Page 5: Probabilistic answers to relational queries (PARQ)

The probabilistic solution

Compute and return answers with high probability ( > pthr)

Keep probabilities hidden from the user

Problems How do we assign a probability to each

data item? How do we choose pthr?

Page 6: Probabilistic answers to relational queries (PARQ)

Concepts

Constraint probabilistic ontologies Is-a graph with edges labeled with

probabilities Including conditional probabilities Disjoint decompositions

Ontologies associated with terms in a data source Attributes in a relation/XML Propositional entities in text sources

Page 7: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 8: Probabilistic answers to relational queries (PARQ)

Running example

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 9: Probabilistic answers to relational queries (PARQ)

Example: decompositions

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 10: Probabilistic answers to relational queries (PARQ)

Example: probability labels

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 11: Probabilistic answers to relational queries (PARQ)

Example: conditional probabilities

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 12: Probabilistic answers to relational queries (PARQ)

Running example: Sample queries

“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%

probability it is a board of directors meeting What type of financial unit is referenced?

Since the subject is marketing policy, there is a 65% probability it is the Financial Review Board.

Page 13: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 14: Probabilistic answers to relational queries (PARQ)

Technical preliminaries: POB

POB schema: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)

for each node me(OrganizationUnit) = {{Comittee, Board,

Team, Department}, {Legal, Executive, Financial, Marketing}}

maps each edge in to a positive rational number in [0,1]

),,,( meC

),( C

),( C

1),(),(, Ld

cdcmeLCc

Page 15: Probabilistic answers to relational queries (PARQ)

Back to the example

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 16: Probabilistic answers to relational queries (PARQ)

Constraint probabilities

Simple constraints: Only for entities NOT represented in the current

ontology Nil constraint:

Constraint probabilities: Pair , with p in [0,1] and a conjunction of

simple constraints

)(, ii AdomDDA

)( iAdomD

),( p

Page 17: Probabilistic answers to relational queries (PARQ)

Labeling

Labeling should not be arbitrary Invalid labeling may lead to time-consuming

consistency algorithms And to ambiguity in interpreting query answers

Valid labeling: No constraint refers to the entities associated

with this ontology There is exactly one nil constraint probability on

each edge

Page 18: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 19: Probabilistic answers to relational queries (PARQ)

The CPO model

CPO: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)

for each node is a valid labeling for

Note there is no condition on the probabilities....yet!

),,,( meC

),( C

),( C

Page 20: Probabilistic answers to relational queries (PARQ)

CPO enhanced data sources

Associate CPOs with some attributes of a relation.

Associate CPOs with elements in an XML data store.

Associate CPOs with some keywords for text files.

CPOk

At most k probabilities on each edge CPO1 is a POB

Page 21: Probabilistic answers to relational queries (PARQ)

Answering queries

“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%

probability it is a board of directors meeting Goal: Associate probabilities with

possible answers.

Page 22: Probabilistic answers to relational queries (PARQ)

Probability path

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 23: Probabilistic answers to relational queries (PARQ)

Probability path

if: c => c1 => c2 => … => ck => d f is a function defined on the chain

f selects one probability on each edge is the set of constraints selected

by f along with the probabilities

),(),( yxyxf

dc p

yx

pyxf,

),(

)( dc pf

Page 24: Probabilistic answers to relational queries (PARQ)

CPO consistency

CPO An arbitrary universe of objects O Interpretation ε is a mapping from C to 2O

ε is a taxonomic model iff: We assign objects to each class Objects cannot be shared between classes in

the same cluster => edges imply subset relations on the sets of

objects assigned to each class If A => B is labeled with probability p, at least

p percent of objects in A are also assigned to B

),,,( meC

Page 25: Probabilistic answers to relational queries (PARQ)

CPO consistency (cont’d)

CPO consistent it has a taxonomic probabilistic model

Deciding if a CPO is consistent is NP-complete The weight formula satisfiability

problem. A non-deterministic algorithm for

consistency checking is straightforward.

Page 26: Probabilistic answers to relational queries (PARQ)

Consistency approach

Identify a subclass of CPOs for which we can check consistency

Two parts: Pseudoconsistency – this was done for

POBs Well-structuredness – particular to

CPOs

Page 27: Probabilistic answers to relational queries (PARQ)

Pseudoconsistent CPO

CPO No two classes in the same cluster have a

common subclass The graph is rooted For every immediate distinct subclasses of c, they

either: Have no common subclass Have a greatest common subclass different from

them No cycles If c inherits from multiple clusters, all paths from

descendants of c to the root go through c

),,,( meC

Page 28: Probabilistic answers to relational queries (PARQ)

Pseudoconsistency

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 29: Probabilistic answers to relational queries (PARQ)

Weight factor

A set P of not-nil constraint probabilities If P is the empty set, wf(P) = 0

If P = {(p,γ)}, wf(P) = p

wf(P U Q) = wf(P) + wf(Q) – wf(P) * wf(Q)

Intuitive meaning: how many objects from class A do I have to assign to class B and satisfy the constraints?

Page 30: Probabilistic answers to relational queries (PARQ)

More weight factors

CPO c => d an edge We write: We define: Result: Conditions of taxonomic

interpretation can be satisfied by selecting at most w(c,d)*|Od| objects from d into c.

),,,( meC

),(')},{(),( 0 dctruepdc

))),('(,max(),( 0 dcwpdcw f

Page 31: Probabilistic answers to relational queries (PARQ)

Well-structured CPO

Conditional constraints on edges from the same cluster must be disjoint Otherwise, impossible to cpumte a

weight factor for the cluster edges. The sum of the weight factors for

edges in a cluster is ≤ 1

Page 32: Probabilistic answers to relational queries (PARQ)

Well-structuredness

0.85

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1,<NOT (IsPresent Ed_Masters) 0.2>

0.15

0.9 0.95

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Page 33: Probabilistic answers to relational queries (PARQ)

Consistent CPOs revisited

A pseudoconsistent and well-structured CPO is consistent Pseudoconsistency accounts for most of

the conditions in the taxonomic interpretation

Well-structuredness accounts for the the assignment of objects to subclasses

Page 34: Probabilistic answers to relational queries (PARQ)

Consistency checking algorithm

Pseudoconsistency is O(n2e) and well-structuredness is O(n2k2) n – number of classes e – the number of edges k – the order of the CPO

Algorithm based on: Topological sort Dijskstra and derivatives

Page 35: Probabilistic answers to relational queries (PARQ)

CPO enhanced algebras

CPO enhanced algebras formally defined for: Relational data sources XML data stores Selection, projection, product, join, etc.

Ongoing work: RDF ehanced query algebra Directly related to RDF extraction from

text.

Page 36: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 37: Probabilistic answers to relational queries (PARQ)

CPO integration: motivationOrganizationUnit

Board

d

ComitteeExecutive Financial

d

FinancialReviewBoard

AuditingComittee

Board of Directors

d

0.1 0.50.4 0.4

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.150.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

OrganizationUnit

Board

d

Team Department Management Financial Marketing

d

FO BoardBoard of Directors

Sales Department

d

0.5 0.1 0.2 0.4 0.4 0.1

0.7

d

0.15

0.90.95

0.3, <(isPresent Ed_Masters) 0.75>

Management :=: FinancialFinancialReviewBoard :=: FO Board

Email from ACME corp. to EVIL corp.: “During you last FO board meeting, the rising costs of quality assurance were not addressed. We would like to include this in our next auditing comittee meeting....

ACME corp. CPO EVIL corp. CPO

Page 38: Probabilistic answers to relational queries (PARQ)

Merging CPOs

Two scenarios: One data source that refers to similar

entities but from different application domains.

Example: ACME – EVIL correspondence Queries across multiple data sources Example: Two different CPOs

associated with distinct relations during a join query.

Page 39: Probabilistic answers to relational queries (PARQ)

Interoperation constraints

Since the CPOs being merged refer to similar entities, some classes may be euqivalent Equality constraints c1:=:c2

Possiblity: immediate subclassing constraints

Not really used – hardly feasible

Page 40: Probabilistic answers to relational queries (PARQ)

The integration problem

Two CPOs S1 = (C1, =>1, me1, φ1), S2 = (C2, =>2, me2, φ2)

Set of interoperation constraints I An integration witness is another

CPO S = (C, =>, me, φ) that satisifes S1, S2 and I

Page 41: Probabilistic answers to relational queries (PARQ)

Integration witness

Every class c in C1 U C2 Appears in C OR c:=:d appears in I and d є C i.e. no classes get “lost”

Similarly, no edges are lost No constraints are lost

If two identical constraint probabilities are on the same edge in both CPOs, take a probability p between the two

Page 42: Probabilistic answers to relational queries (PARQ)

Integration witness

Immediate subclassing constraints add edges to S

No cluster can be split as a result of merging

S is pseudoconsistent and well-structured (if it’s not, it’s of no use) Open problem: If it is not, how can we

minimally change it such that it has these properties?

Page 43: Probabilistic answers to relational queries (PARQ)

CPOmerge algorithm

CPOmerge produces an integration witness if exists

O(n3) – costly In pratice, much more efficient

through: Caching Some properties are preserved if the

original ontologies are pseudo-consistent and well-structured

Page 44: Probabilistic answers to relational queries (PARQ)

Who writes the interop constraints?

User – not feasible How to infer them? Intuitive solution: If enough neighbours

are in equality constraints, then infer respective nodes should be equivalent. But we still need some equivalence constraints

to get started – use lexical distance How many neighbors are “enough”?

Page 45: Probabilistic answers to relational queries (PARQ)

ICI – Simple solution

Neighbor: parent, immediate child, sibling from the same cluster

We define

ne – number of neighbors in equality constraints nc,d – number of neighbors of c,d Why? Number of equal neighbors / Total number

of neighbors (including self). Always < 1 ICI algorithm: if pe exceeds threshold, assume they

are equal Start with lexical distance

2

2),(

dc

ee nn

ndcp

Page 46: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 47: Probabilistic answers to relational queries (PARQ)

Give me a CPO…

Very little work so far on probabilistic ontologies. Nothing resembling CPOs around

How do we infer them: How do we build disjoint

decompositions? How do we infer probabilities?

Page 48: Probabilistic answers to relational queries (PARQ)

Building disjoint decompositions

Take regular ontologies from the Web Many sources: daml.org, SchemaWeb,

OntoBroker Modify CPOmerge to ignore labeling The merge result will contain

disjoint decompositions Equality constraints can be inferred

through ICI

Page 49: Probabilistic answers to relational queries (PARQ)

Infer probabilities – simple methods

Simple methods: Distribute probabilities uniformly within each

cluster For each cluster L in me(c), d=>c,

For any distance function (lexical or otherwise)

ce

ceDist

cdDist

cd e

ep

),(

),(

,

Page 50: Probabilistic answers to relational queries (PARQ)

Advanced methods

Probabilistic relational models with structural uncertainty Work by Dr. Getoor et. al

Classification approach Feature extraction determines entities of

interest Create conditional probabilities on those

entities User feedback approach

General, applicable to any of the above(ongoing work)

Page 51: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 52: Probabilistic answers to relational queries (PARQ)

Experimental setup

Java implementation CPO enhanced relational DB

Movies database maintained by Dr. Wiederhold

IMDB data IMDB to estimate recall Classifications from the Web to build

initial CPO

Page 53: Probabilistic answers to relational queries (PARQ)

Consistency check & inference

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

45 70 92 145 172 204 237 275 304 321 342ontology size [no. of classes]

run

nin

g t

ime

[s]

consistency

CPO inference

Page 54: Probabilistic answers to relational queries (PARQ)

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

45 70 92 145 172 204 237 275 304 321 342

Ontology size [no. of classes]

Re

call

threshold: 0.6

threshold:0.7

threshold:0.8

threshold:0.9

relational

Page 55: Probabilistic answers to relational queries (PARQ)

Precision

0

0.2

0.4

0.6

0.8

1

1.2

45 70 92 145 172 204 237 275 304 321 342

Ontology size [no. of classes]

Pre

cisi

on

Precision p:0.6

Precision p:0.7

Precision p:0.8

Precision p:0.9

Precision relational

Page 56: Probabilistic answers to relational queries (PARQ)

Answer quality

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0 50 100 150 200 250 300 350 400

Ontology size [no. of classes]

Qu

alit

y [S

QR

T(P

reci

sio

n*R

eca

ll)]

Quality p:0.6

Quality p:0.7

Quality p:0.8

Quality p:0.9

Quality relational

Page 57: Probabilistic answers to relational queries (PARQ)

Query running time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

45 70 92 145 172 204 237 275 304 321 342

Ontology size [no. of classes]

Ru

nn

ing

tim

e [s

]

Running time p:0.6

Running time p:0.7

Running time p:0.8

Running time p:0.9

Running time relational

Page 58: Probabilistic answers to relational queries (PARQ)

ICI quality

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 0.2 0.4 0.6 0.8 1 1.2

Epsilon

Qu

alit

yJoin quality

Relational join quality

Page 59: Probabilistic answers to relational queries (PARQ)

Bottomline

Clear improvement in query answer quality Some time penalty, but reasonable

Very little user intervention CPOs are suited for a wide variety of

data sources Potentially, they can be used to convey

semantics across heterogenous data sources

Page 60: Probabilistic answers to relational queries (PARQ)

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Page 61: Probabilistic answers to relational queries (PARQ)

Current experimental setup

DBLP data over 60 years of scientific publications XML data set

CPOs from complex ontologies DBLP classification ACM classification of subjects

Page 62: Probabilistic answers to relational queries (PARQ)

Goals (1)

Determine the efficiency of advanced CPO inference methods

Experimentally determine the best approach in terms of minimizing user feedback

Page 63: Probabilistic answers to relational queries (PARQ)

Goals (2)

Use CPOs with RDF databases For extracting RDF from text as a means of

using semantic information For answering queries from RDF databases

Benefits: Probabilistic model is clearly formalized Proven improvement in answer quality

Experimentally determine what the probability threshold may be for various domains

Page 64: Probabilistic answers to relational queries (PARQ)

Thank you