Upload
randolf-kelley
View
215
Download
2
Embed Size (px)
Citation preview
Uncertainty in Databases
Lecture 1: Introduction
Faculty of Computer ScienceTechnion – Israel Institute of Technology
Spring 2015
2
Assumed Background
• Databases– Relational model, database querying, SQL,
relational algebra, schema, integrity constraints (e.g., functional dependencies)
• Algorithms and complexity– Asymptotic running time, ptime, NP,
completeness, reduction
• Basic probability theory– Probability space, event, random variable,
conditional probability
3
Attendance Requirement
• 4 mandatory assignments, no exam– Theoretical (20%), programmatic (30%),
theoretical (20%), programmatic (30%)
• To get a grade, students must submit all assignments and attend lectures– <=2 misses is fine, >5 misses is unacceptable– Exception: students who miss 3-5 lectures can
get a grade by attending an easy exam on the course material
• Must pass, 10% of the grade (other grades normalized accordingly)
Some Modern Database Content
5Integration
Business DBs
Social Media
Text Analytics / NLP
Sensing Data
OCR / Image
Web Pages Financial Reports Med Reports
Knowledge Bases
Gov Reports
Signal / Image Processing
6
Knowledge Bases
• Microsoft Probase
• Google Knowledge Graph
• Google Knowledge Vault
• Stanford DeepDive
Israel
country0.4
Person0.2
location0.35
Instance
Concept
Probability
• MPI YAGO
• CMU NELL
• Freebase
• . . .
Concept Instance
Attribute Value
Relationship Relationship
7
Relating to Big Data
Volum
e Velocity
Varie
ty
Vera
city
Value
• Missing information• Conflicting Information• Probabilistic information
8
Popular Topics in DB Research
• VLDB 2014 Ten Year Best Paper– Nilesh Dalvi and Dan Suciu: Efficient Query
Evaluation on Probabilistic Databases
• PODS 2014 Keynote– Leonid Libkin: Incomplete data: what went wrong,
and how to fix it
• SIGMOD/PODS 2014 Workshop on Big Uncertain Data – Kimelfeld (DB) and Kersting (AI)
• ICDT 2013 Test-of-Time Award– Ronald Fagin, Phokion Kolaitis, Renee Miller, and
Lucian Popa: Data Exchange: Semantics and Query Answering
9
What’s in the Course?
• Principled, application-independent paradigms to managing uncertainty in data– Incomplete / inconsistent / probabilistic
databases
• Two key aspects for every paradigm:– Representation
• How do we represent what we know, what is missing, and what is our confidence?
– Query evaluation• What is the meaning of query answering in the
presence of uncertainty? What is the involved computational complexity?
11
Missing Information
• Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations
student course
Ahuva PL
Courses
course lecturer
PL Eran
• A source tells us that Alon is a student of Keren – How can we represent it in our DB?Registrations
student course
Ahuva PL
Alon ⊥
Courses
course lecturer
PL Eran
⊥ Keren
⊥=NULL
12
SQL’s NULL
• NULL is SQL’s special “missing value”
• Same queries as complete tables, but SQL assigns a special behavior to logic over NULL– “Three-valued logic”: true, false, unknown
• Alas, there are some issues...
13
Try It Yourself (psql)CREATE TABLE Registrations( student varchar(40), course varchar(40));
INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL);
CREATE TABLE Courses( course varchar(40), lecturer varchar(40));
INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren');
Registrations
student course
Ahuva PL
Alon ⊥
Courses
course lecturer
PL Eran
⊥ Keren
SELECT student, lecturer FROM Registrations R, Courses CWHERE R.course = C.course;
student lecturer
Ahuva Eran
Of course, we've lost our initial association (join)...
14
Try More Yourself (psql)Registrations
student courseAhuva PLAlon ⊥
Coursescourse lecturer
PL Eran⊥ Keren
SELECT student FROM Registrations;
student
Ahuva
Alon
Inconsistent logic... real problem!
SELECT student FROM RegistrationsWHERE course='PL';
student
Ahuva
SELECT student FROM RegistrationsWHERE course!='PL';
student
SELECT student FROM RegistrationsWHERE course='PL' OR course!='PL';
student
Ahuva
Alon??
15
Labeled Nulls in “Naive” Tables
Registrations
student course
Ahuva PL
Alon ⊥1
Ahuva ⊥2
Courses
course lecturer
PL Eran
⊥1 Keren
⊥2 Shaul
• Just like nulls, but each null has a name– We do not know what the value is, but we do know
that two nulls with the same name are the same
⨝ =
student course lecturer
Ahuva PL Eran
Alon ⊥1 Keren
Ahuva ⊥2 Shaul
? ? ?
? ? ?
16
Possible Worlds
Registrations
student course
Ahuva PL
Alon ⊥1
Ahuva ⊥2
Registrations
student course
Ahuva PL
Alon PL
Ahuva DB
Registrations
student course
Ahuva PL
Alon DB
Ahuva DB. . .
Closed-World Assumption:
Registrations
student course
Ahuva PL
Alon ⊥1
Ahuva ⊥2
Registrations
student course
Ahuva PL
Alon PL
Ahuva DB
Anna AI
Registrations
student course
Ahuva PL
Alon DB
Ahuva DB
Ahuva AI
Avi ML
. . .Open-World Assumption:
Semantics of Query Answering
19
. . .
a1
a2
a3
a4
QQQQ
∩aiCertain answers
(“weak)
Incomplete DB
Q
Possible Worlds
{a1,a2,…}Represent as an
incomplete relation(“strong”)
Formalism [Fagin et al. 05]
TaughtBystudent courseAhuva ShaulAlon Keren
Registrationsstudent course
Coursescourse lecturer
StudLecturerstudent lecturer
A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S
relates to TTS
StudLecturer(x,y) z Registrations(x,z) Courses(z,y)∃ ⋀
source instance
Σ
?? We don’t have z! So 2 options:1) Abort2) Do our best to max usability
23
Formalism [Fagin et al. 05]
Registrationsstudent courseAhuva ⊥1
Alon ⊥2
Coursescourse lecturer
⊥1 Shaul⊥2 Keren
TaughtBystudent courseAhuva ShaulAlon Keren
source instance solution
Registrations
student course
Courses
course lecturer
StudLecturer
student lecturer
TS
StudLecturer(x,y) z Registrations(x,z) Courses(z,y)∃ ⋀
Σ
A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S
relates to T
24
Problems Studied in Data Exchange
• Materialization– Many solutions exist; what makes one solution
“better” than another? If there a “best” solution? How can we find it?
• Target query answering– Given a source instance and a query over the
target, evaluate the query (semantics / complexity)
• Manipulating schema mappings– Composition and inversion of mappings
26
Inconsistency
• An inconsistent database contains inconsistent (or impossible) information– Two students have the same ID– A student gets credit for the same course twice– A student takes a course that is not listed in
the course database– A student has a grade for this course but a
grade is missing for an assignment
• Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ
27
Query AnsweringGrades
student course grade
Ahuva PL 90
Alon PL 86
Alon PL 81
Courses
course lecturer
PL Eran
DC Keren
Database D
Functional Dependency: student, course grade
Integrity Constraints Σ
SELECT student FROM Grades G, Courses CWHERE G.grade >= 85 AND
G.course = C.course AND C.lecturer=‘Eran’
Ahuva
Alon ?
28
Query AnsweringGrades
student course grade
Ahuva PL 90
Alon PL 86
Alon PL 81
Courses
course lecturer
PL Eran
DC Keren
Database D
Functional Dependency: Student, Course Grade
Integrity Constraints Σ
SELECT student FROM Grades G, Courses CWHERE G.grade >= 87 AND
G.course = C.course AND C.lecturer=‘Eran’
Ahuva
AlonX
29
Query AnsweringGrades
student course grade
Ahuva PL 90
Alon PL 86
Alon PL 81
Courses
course lecturer
PL Eran
DC Keren
Database D
Functional Dependency: Student, Course Grade
Integrity Constraints Σ
SELECT student FROM Grades G, Courses CWHERE G.grade >= 80 AND
G.course = C.course AND C.lecturer=‘Eran’
Ahuva
Alon
30
Minimal Repairs [Arenas, Bertossi, Chomicki 99]:
DEFINITION: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that:
1. DB D' is consistent (with respect to Σ)2. DB D' differs from D in a “minimal way”
Grades
student course grade
Ahuva PL 90
Alon PL 86
Alon PL 81
Grades
student course grade
Ahuva PL 90
Alon PL 86
Grades
student course grade
Ahuva PL 90
Alon PL 81
Inconsistent database D
Repair D'1
Repair D'2
Semantics of Query Answering
32
. . .a1
a2
a3
a4
an
QQQQ
Q
Inconsistent DB
Q
?
Repairs (consistent DBs)
Semantics of Query Answering
33
. . .a1
a2
a3
a4
an
QQQQ
Q∩ai
Consistent Answers
Inconsistent DB
Q
Repairs (consistent DBs)
34
Algorithms / Complexity
Very recent result by Koutris & Wijsen: For consistent query answering with key constraints, Select-Project-Join (SPJ) queries w/o repeated relations can be classified into three categories:
Inconsistent DB
Q Q'
1. 2.
Rewriting
Inconsistent DB
Graph algorithm
ALG∩ai ∩ai
3.
coNP-complete(exptime under standard complexity assumptions)
ignore inconsistency
35
Incorporating Preferences
Courses
course lecturer
DB Keren
DC Keren
DC Eran
Functional dependencies: course lecturerlecturer course
What if we trust tuple 2 more than tuple 1?
Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 209-246 (2012)
How to accommodate the probabilistic nature of data at the database & query level?
37
Student University
Ahuva Technion
AlonTechnion
HaifaU
Employee Employer Role
Ahuva IntelEngPMVP
AlonYahoo! EngGoogle Eng
Intel PM
• Find the students that are employed as engineers
• How many students work at Intel?
• Is any PM a Technion student?
38
How to accommodate the probabilistic nature of data at the database & query level?
Pr
1.0
0.7
0.3
Pr
0.7
0.2
0.1
0.4
0.4
0.1
Student University
Ahuva Technion
AlonTechnion
HaifaU
Employee Employer Role
Ahuva Intel
Eng
PM
VP
Alon
Yahoo! Eng
Google Eng
Intel PM
• Find the students that are employed as engineers- Ahuva (0.7), Alon (0.8)
• How many students work at Intel?- Expectation = 1 + 0.1
• Is any PM a Technion student?- Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))
Semantics of Query Answering
40
. . .p1
p2
p3
p4
pn
Q ?
Probabilistic Database
Space of ordinary DBs
Semantics of Query Answering
41
. . .p1
p2
p3
p4
pn
Q ?
Probabilistic Database a1
a2
a3
a4
an
QQQQ
QSpace of ordinary DBs
p1
p2
p3
p4
pn
Semantics of Query Answering
42
. . .p1
p2
p3
p4
pn
Q A
Probabilistic Database a1
a2
a3
a4
an
QQQQ
QSpace of ordinary DBs
p1
p2
p3
p4
pn
Rep of the probability space
Mapping tuple marginal probability
43
Algorithms for Query Answering
• Dalvi & Suciu dichotomy: SPJ queries can be fully classified into:– Queries that can be solved in polynomial time
• By repeated decomposition into simpler queries
– Queries for which answering is #P-hard • Hence, cannot be computed in polynomial time under
standard complexity assumptions
• Heuristic via BDDs [Olteanu+]
• Guaranteed approximation via sampling– Additive approx. p± is simple𝜀– Multiplicative approx. (1± )p requires more 𝜀
work
44
Probabilistic XML
[Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation
0.60.5
university
department
name
Paul
member
chair
position
f. prof a. prof
ph.d. studs
0.5
chair
position
f. prof a. prof
member
name
David
name
Amy
name
Emily
name
Nicole
46
1 24/03 Intro2 31/03 DB Essentials
07/04 Passover3 12/04* (comp) Incompleteness4 14/04 Data Exchange5 21/04 Inconsistent DBs
Assignment 1 due6 28/04 Consistent Q Answering7 05/05 Consistent Q Answering8 12/05 Pref. Repairs + Misc
Assignment 2 due9 19/05 Probabilistic DB
10 26/05 Query Inference02/06 No Lecture
Assignment 3 due11 09/06 Query Inference
16/06 Guest Lecture12 23/06 Extras
Assignment 4 due