46
Uncertainty in Databases Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2015

Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2015

Embed Size (px)

Citation preview

Uncertainty in Databases

Lecture 1: Introduction

Faculty of Computer ScienceTechnion – Israel Institute of Technology

Spring 2015

2

Assumed Background

• Databases– Relational model, database querying, SQL,

relational algebra, schema, integrity constraints (e.g., functional dependencies)

• Algorithms and complexity– Asymptotic running time, ptime, NP,

completeness, reduction

• Basic probability theory– Probability space, event, random variable,

conditional probability

3

Attendance Requirement

• 4 mandatory assignments, no exam– Theoretical (20%), programmatic (30%),

theoretical (20%), programmatic (30%)

• To get a grade, students must submit all assignments and attend lectures– <=2 misses is fine, >5 misses is unacceptable– Exception: students who miss 3-5 lectures can

get a grade by attending an easy exam on the course material

• Must pass, 10% of the grade (other grades normalized accordingly)

4

Lecture 1: Introduction

UNCERTAINTY IN DATABASES

Some Modern Database Content

5Integration

Business DBs

Social Media

Text Analytics / NLP

Sensing Data

OCR / Image

Web Pages Financial Reports Med Reports

Knowledge Bases

Gov Reports

Signal / Image Processing

6

Knowledge Bases

• Microsoft Probase

• Google Knowledge Graph

• Google Knowledge Vault

• Stanford DeepDive

Israel

country0.4

Person0.2

location0.35

Instance

Concept

Probability

• MPI YAGO

• CMU NELL

• Freebase

• . . .

Concept Instance

Attribute Value

Relationship Relationship

7

Relating to Big Data

Volum

e Velocity

Varie

ty

Vera

city

Value

• Missing information• Conflicting Information• Probabilistic information

8

Popular Topics in DB Research

• VLDB 2014 Ten Year Best Paper– Nilesh Dalvi and Dan Suciu: Efficient Query

Evaluation on Probabilistic Databases

• PODS 2014 Keynote– Leonid Libkin: Incomplete data: what went wrong,

and how to fix it

• SIGMOD/PODS 2014 Workshop on Big Uncertain Data – Kimelfeld (DB) and Kersting (AI)

• ICDT 2013 Test-of-Time Award– Ronald Fagin, Phokion Kolaitis, Renee Miller, and

Lucian Popa: Data Exchange: Semantics and Query Answering

9

What’s in the Course?

• Principled, application-independent paradigms to managing uncertainty in data– Incomplete / inconsistent / probabilistic

databases

• Two key aspects for every paradigm:– Representation

• How do we represent what we know, what is missing, and what is our confidence?

– Query evaluation• What is the meaning of query answering in the

presence of uncertainty? What is the involved computational complexity?

10

Lecture 1: Introduction

INCOMPLETE DATABASES

11

Missing Information

• Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations

student course

Ahuva PL

Courses

course lecturer

PL Eran

• A source tells us that Alon is a student of Keren – How can we represent it in our DB?Registrations

student course

Ahuva PL

Alon ⊥

Courses

course lecturer

PL Eran

⊥ Keren

⊥=NULL

12

SQL’s NULL

• NULL is SQL’s special “missing value”

• Same queries as complete tables, but SQL assigns a special behavior to logic over NULL– “Three-valued logic”: true, false, unknown

• Alas, there are some issues...

13

Try It Yourself (psql)CREATE TABLE Registrations( student varchar(40), course varchar(40));

INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL);

CREATE TABLE Courses( course varchar(40), lecturer varchar(40));

INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren');

Registrations

student course

Ahuva PL

Alon ⊥

Courses

course lecturer

PL Eran

⊥ Keren

SELECT student, lecturer FROM Registrations R, Courses CWHERE R.course = C.course;

student lecturer

Ahuva Eran

Of course, we've lost our initial association (join)...

14

Try More Yourself (psql)Registrations

student courseAhuva PLAlon ⊥

Coursescourse lecturer

PL Eran⊥ Keren

SELECT student FROM Registrations;

student

Ahuva

Alon

Inconsistent logic... real problem!

SELECT student FROM RegistrationsWHERE course='PL';

student

Ahuva

SELECT student FROM RegistrationsWHERE course!='PL';

student

SELECT student FROM RegistrationsWHERE course='PL' OR course!='PL';

student

Ahuva

Alon??

15

Labeled Nulls in “Naive” Tables

Registrations

student course

Ahuva PL

Alon ⊥1

Ahuva ⊥2

Courses

course lecturer

PL Eran

⊥1 Keren

⊥2 Shaul

• Just like nulls, but each null has a name– We do not know what the value is, but we do know

that two nulls with the same name are the same

⨝ =

student course lecturer

Ahuva PL Eran

Alon ⊥1 Keren

Ahuva ⊥2 Shaul

? ? ?

? ? ?

16

Possible Worlds

Registrations

student course

Ahuva PL

Alon ⊥1

Ahuva ⊥2

Registrations

student course

Ahuva PL

Alon PL

Ahuva DB

Registrations

student course

Ahuva PL

Alon DB

Ahuva DB. . .

Closed-World Assumption:

Registrations

student course

Ahuva PL

Alon ⊥1

Ahuva ⊥2

Registrations

student course

Ahuva PL

Alon PL

Ahuva DB

Anna AI

Registrations

student course

Ahuva PL

Alon DB

Ahuva DB

Ahuva AI

Avi ML

. . .Open-World Assumption:

Semantics of Query Answering

Q

Incomplete DB

?

. . .

17

Possible Worlds

Semantics of Query AnsweringIncomplete DB

Q

?

a1

a2

a3

a4

QQQQ . . .

18

Possible Worlds

Semantics of Query Answering

19

. . .

a1

a2

a3

a4

QQQQ

∩aiCertain answers

(“weak)

Incomplete DB

Q

Possible Worlds

{a1,a2,…}Represent as an

incomplete relation(“strong”)

20

Application: Data Exchange

Mapping

Users AssociationsMessages ...Global Schema

21

The Clio Project

IBM + U. Toronto – tool for data exchangeCommercialized in IBM DB2

Formalism [Fagin et al. 05]

TaughtBystudent courseAhuva ShaulAlon Keren

Registrationsstudent course

Coursescourse lecturer

StudLecturerstudent lecturer

A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S

relates to TTS

StudLecturer(x,y) z Registrations(x,z) Courses(z,y)∃ ⋀

source instance

Σ

?? We don’t have z! So 2 options:1) Abort2) Do our best to max usability

23

Formalism [Fagin et al. 05]

Registrationsstudent courseAhuva ⊥1

Alon ⊥2

Coursescourse lecturer

⊥1 Shaul⊥2 Keren

TaughtBystudent courseAhuva ShaulAlon Keren

source instance solution

Registrations

student course

Courses

course lecturer

StudLecturer

student lecturer

TS

StudLecturer(x,y) z Registrations(x,z) Courses(z,y)∃ ⋀

Σ

A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S

relates to T

24

Problems Studied in Data Exchange

• Materialization– Many solutions exist; what makes one solution

“better” than another? If there a “best” solution? How can we find it?

• Target query answering– Given a source instance and a query over the

target, evaluate the query (semantics / complexity)

• Manipulating schema mappings– Composition and inversion of mappings

25

Lecture 1: Introduction

INCONSISTENT DATABASES

26

Inconsistency

• An inconsistent database contains inconsistent (or impossible) information– Two students have the same ID– A student gets credit for the same course twice– A student takes a course that is not listed in

the course database– A student has a grade for this course but a

grade is missing for an assignment

• Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ

27

Query AnsweringGrades

student course grade

Ahuva PL 90

Alon PL 86

Alon PL 81

Courses

course lecturer

PL Eran

DC Keren

Database D

Functional Dependency: student, course grade

Integrity Constraints Σ

SELECT student FROM Grades G, Courses CWHERE G.grade >= 85 AND

G.course = C.course AND C.lecturer=‘Eran’

Ahuva

Alon ?

28

Query AnsweringGrades

student course grade

Ahuva PL 90

Alon PL 86

Alon PL 81

Courses

course lecturer

PL Eran

DC Keren

Database D

Functional Dependency: Student, Course Grade

Integrity Constraints Σ

SELECT student FROM Grades G, Courses CWHERE G.grade >= 87 AND

G.course = C.course AND C.lecturer=‘Eran’

Ahuva

AlonX

29

Query AnsweringGrades

student course grade

Ahuva PL 90

Alon PL 86

Alon PL 81

Courses

course lecturer

PL Eran

DC Keren

Database D

Functional Dependency: Student, Course Grade

Integrity Constraints Σ

SELECT student FROM Grades G, Courses CWHERE G.grade >= 80 AND

G.course = C.course AND C.lecturer=‘Eran’

Ahuva

Alon

30

Minimal Repairs [Arenas, Bertossi, Chomicki 99]:

DEFINITION: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that:

1. DB D' is consistent (with respect to Σ)2. DB D' differs from D in a “minimal way”

Grades

student course grade

Ahuva PL 90

Alon PL 86

Alon PL 81

Grades

student course grade

Ahuva PL 90

Alon PL 86

Grades

student course grade

Ahuva PL 90

Alon PL 81

Inconsistent database D

Repair D'1

Repair D'2

Semantics of Query Answering

31

. . .

Q

Repairs (consistent DBs)

Inconsistent DB

?

Semantics of Query Answering

32

. . .a1

a2

a3

a4

an

QQQQ

Q

Inconsistent DB

Q

?

Repairs (consistent DBs)

Semantics of Query Answering

33

. . .a1

a2

a3

a4

an

QQQQ

Q∩ai

Consistent Answers

Inconsistent DB

Q

Repairs (consistent DBs)

34

Algorithms / Complexity

Very recent result by Koutris & Wijsen: For consistent query answering with key constraints, Select-Project-Join (SPJ) queries w/o repeated relations can be classified into three categories:

Inconsistent DB

Q Q'

1. 2.

Rewriting

Inconsistent DB

Graph algorithm

ALG∩ai ∩ai

3.

coNP-complete(exptime under standard complexity assumptions)

ignore inconsistency

35

Incorporating Preferences

Courses

course lecturer

DB Keren

DC Keren

DC Eran

Functional dependencies: course lecturerlecturer course

What if we trust tuple 2 more than tuple 1?

Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 209-246 (2012)

36

Lecture 1: Introduction

PROBABILISTIC DATABASES

How to accommodate the probabilistic nature of data at the database & query level?

37

Student University

Ahuva Technion

AlonTechnion

HaifaU

Employee Employer Role

Ahuva IntelEngPMVP

AlonYahoo! EngGoogle Eng

Intel PM

• Find the students that are employed as engineers

• How many students work at Intel?

• Is any PM a Technion student?

38

How to accommodate the probabilistic nature of data at the database & query level?

Pr

1.0

0.7

0.3

Pr

0.7

0.2

0.1

0.4

0.4

0.1

Student University

Ahuva Technion

AlonTechnion

HaifaU

Employee Employer Role

Ahuva Intel

Eng

PM

VP

Alon

Yahoo! Eng

Google Eng

Intel PM

• Find the students that are employed as engineers- Ahuva (0.7), Alon (0.8)

• How many students work at Intel?- Expectation = 1 + 0.1

• Is any PM a Technion student?- Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))

Semantics

39

. . .p1

p2

p3

p4

pn

Probabilistic DB

Space of ordinary DBs

Semantics of Query Answering

40

. . .p1

p2

p3

p4

pn

Q ?

Probabilistic Database

Space of ordinary DBs

Semantics of Query Answering

41

. . .p1

p2

p3

p4

pn

Q ?

Probabilistic Database a1

a2

a3

a4

an

QQQQ

QSpace of ordinary DBs

p1

p2

p3

p4

pn

Semantics of Query Answering

42

. . .p1

p2

p3

p4

pn

Q A

Probabilistic Database a1

a2

a3

a4

an

QQQQ

QSpace of ordinary DBs

p1

p2

p3

p4

pn

Rep of the probability space

Mapping tuple marginal probability

43

Algorithms for Query Answering

• Dalvi & Suciu dichotomy: SPJ queries can be fully classified into:– Queries that can be solved in polynomial time

• By repeated decomposition into simpler queries

– Queries for which answering is #P-hard • Hence, cannot be computed in polynomial time under

standard complexity assumptions

• Heuristic via BDDs [Olteanu+]

• Guaranteed approximation via sampling– Additive approx. p± is simple𝜀– Multiplicative approx. (1± )p requires more 𝜀

work

44

Probabilistic XML

[Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation

0.60.5

university

department

name

Paul

member

chair

position

f. prof a. prof

ph.d. studs

0.5

chair

position

f. prof a. prof

member

name

David

name

Amy

name

Emily

name

Nicole

45

Lecture 1: Introduction

PLANNED SCHEDULE

46

1 24/03 Intro2 31/03 DB Essentials

07/04 Passover3 12/04* (comp) Incompleteness4 14/04 Data Exchange5 21/04 Inconsistent DBs

Assignment 1 due6 28/04 Consistent Q Answering7 05/05 Consistent Q Answering8 12/05 Pref. Repairs + Misc

Assignment 2 due9 19/05 Probabilistic DB

10 26/05 Query Inference02/06 No Lecture

Assignment 3 due11 09/06 Query Inference

16/06 Guest Lecture12 23/06 Extras

Assignment 4 due