43
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”

Trio Notes

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Trio Notes

Trio: A System for Data, Uncertainty, and Lineage

Search “stanford trio”

Page 2: Trio Notes

2

Stanford Report: March ‘06

Page 3: Trio Notes

3

Trio

1. Data Student #123 is majoring in Econ: (123,Econ)

Major

2. Uncertainty Student #123 is majoring in Econ or CS:

(123, Econ ∥ CS) Major With confidence 60% student #456 is a CS major:

(456, CS: 0.6) Major

3. Lineage (456) HardWorker derived from:

(456, CS) Major“CS is hard” some web page

Page 4: Trio Notes

4

The Picture

Data

Uncertainty

Lineage(“sourcing”)

Page 5: Trio Notes

5

Why Uncertainty + Lineage?

Many applications seem to need both

• Information extraction systems

• Scientific and sensor data management

• Information integration

• Deduplication (data cleaning)

• Approximate query processing

Page 6: Trio Notes

6

Why Uncertainty + Lineage?

From a technical standpoint, it turns out that

lineage...1. Enables simple and consistent

representation of uncertain data

2. Correlates uncertainty in query results with uncertainty in the input data

3. Can make computation over uncertain data more efficient

Page 7: Trio Notes

7

Goal

A new kind of DBMS in which:

1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

With all the usual DBMS featuresScalable, reliable, efficient, ad-hoc declarative queries and updates, …

Trio Trio }

Page 8: Trio Notes

8

The Trio Trio

1. Data Model Simplest extension to relational model that’s

sufficiently expressive

2. Query Language Simple extension to SQL with well-defined

semantics and intuitive behavior

3. System A complete open-source DBMS that people

want to use

Page 9: Trio Notes

9

The Trio Trio

1. Data Model Uncertainty-Lineage Databases (ULDBs)

2. Query Language TriQL

3. System First prototype built on top of standard DBMS

Page 10: Trio Notes

10

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Drives(person,car) // may be uncertain

Suspects(person) = πperson(Saw ⋈ Drives)

Page 11: Trio Notes

11

Data Model: Uncertainty

An uncertain database represents a set ofpossible instances

• Amy saw either a Honda or a Toyota

• Jimmy drives a Toyota, a Mazda, or both

• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

• Hank is a suspect with confidence 0.7

Page 12: Trio Notes

12

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 13: Trio Notes

13

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }

=

Three possibleinstances

Page 14: Trio Notes

14

Six possibleinstances

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about presence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 15: Trio Notes

15

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 16: Trio Notes

16

Deficiency in Model

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 17: Trio Notes

17

Lineage to the Rescue

Lineage (provenance): “where data came from”• Internal lineage

• External lineage

In Trio: A function λ from alternatives to other

alternatives (or external sources)

Page 18: Trio Notes

18

Example with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 19: Trio Notes

19

Uncertainty-Lineage Databases (ULDBs)

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

The ULDB model is “complete”

Page 20: Trio Notes

20

Querying ULDBs

• Simple extension to SQL

• Formal semantics, intuitive meaning

• Query uncertainty, confidences, and lineage

TriQLTriQL

Page 21: Trio Notes

21

Initial TriQL Example

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

???

λ(31) = (11,2),(21,2)λ(32,1)=(11,1),(22,1); λ(32,2)=(11,1),(22,2)

λ(33) = (11,1), 23

Page 22: Trio Notes

22

Formal Semantics

Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

operational semanticsD + ResultD + Result

Page 23: Trio Notes

23

TriQL: Querying Confidences

Built-in function: Conf()

SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car AND Conf(Saw) > 0.5 AND Conf(Drives) >

0.8

Page 24: Trio Notes

24

TriQL: Querying Lineage

Built-in join predicate: Lineage()

SELECT Saw.witness INTO AccusesHank FROM Suspects, Saw WHERE Lineage(Suspects,Saw) AND Suspects.person = ‘Hank’

Page 25: Trio Notes

25

Operational Semantics

Over conventional relational database:

For each tuple in cross-product of X1, X2, ..., Xn

1. Evaluate the predicate

2. If true, project attr-list to create result tuple

3. If INTO clause, insert into table

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

Page 26: Trio Notes

26

Operational Semantics

Over ULDB:For each tuple in cross-product of X1, X2, ..., Xn

1. Create “super tuple” T from all combinations of alternatives

2. Evaluate predicate on each alternative in T ; keep only the true ones

3. Project attr-list on each alternative to create result tuple

4. Details: ‘?’, lineage, confidences

SELECT attr-list [ INTO table ]FROM X1, X2, ..., Xn

WHERE predicate

Page 27: Trio Notes

27

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Page 28: Trio Notes

28

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 29: Trio Notes

29

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 30: Trio Notes

30

Operational Semantics: Example

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

Page 31: Trio Notes

31

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 32: Trio Notes

32

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 33: Trio Notes

33

Operational Semantics: Example

(Cathy,Honda,Jim,Mazda)∥(Cathy,Honda,Bill,Mazda)∥(Cathy,Mazda,Jim,Mazda)∥(Cathy,Mazda,Bill,Mazda)

SELECT Drives.personFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

(Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)

Page 34: Trio Notes

34

Operational Semantics: Example

SELECT Drives.person INTO SuspectsFROM Saw, DrivesWHERE Saw.car = Drives.car

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jim, Mazda) ∥ (Bill, Mazda)

(Hank, Honda)

Suspects

Jim ∥ Bill

Hank

??λ( ) = ...λ( ) = ...

Page 35: Trio Notes

35

Confidences

Confidences supplied with base data

Trio computes confidences on query results• Default probabilistic interpretation• Can choose to plug in different arithmetic

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda)

Suspects

Jim: 0.12 ∥ Bill: 0.24

Hank: 0.6

?

?

?

0.3 0.4

0.6

ProbabilisticProbabilistic Min

Page 36: Trio Notes

36

Additional Query Constructs

• “Horizontal subqueries”Refer to tuple alternatives as a relation

• Unmerged (horizontal duplicates)

• Flatten, GroupAlts

• NoLineage, NoConf, NoMaybe

• Query-computed confidences

• Data modification statements

Page 37: Trio Notes

37

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

List suspects with conf values based on accuser credibility

Page 38: Trio Notes

38

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 39: Trio Notes

39

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 40: Trio Notes

40

Final Example Query

PrimeSuspect (crime#, accuser, suspect)

(1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank)

(2, Cathy, Frank) ∥ (2, Betty, Freddy)

person score

Amy 10

Betty 15

Cathy 5

Suspects

Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166

Frank: 0.25 ∥ Freddy: 0.75

Credibility

SELECT suspect, score/[sum(score)] as confFROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P)

Page 41: Trio Notes

41

Trio System: Version 1

Standard relational DBMS

Trio API and translator(Python)

Trio API and translator(Python)

Command-lineclient

Command-lineclient

TrioMetadat

a

TrioExplorer(GUI client)

TrioExplorer(GUI client)

Trio Stored

Procedures

EncodedData

TablesLineageTables

Standard SQL• “Verticalize”• Shared IDs for alternatives• Columns for confidence,“?”• One per result table• Uses unique IDs

• Table types• Schema-level lineage structure• conf()• lineage() “==>”

• DDL commands• TriQL queries• Schema browsing• Table browsing• Explore lineage• On-demand confidence computation

Page 42: Trio Notes

42

Future Features (sample)

Uncertainty• Incomplete relations• Continuous uncertainty• Correlated uncertainty

Lineage• External lineage• Update lineage

Query processing• “Top-K” by confidence

Page 43: Trio Notes

but don’t forgetthe lineage…