21
ERACER: A Database Approach for Statistical Inference and Data Cleaning Chris Mayfield Jennifer Neville Sunil Prabhakar Department of Computer Science, Purdue University West Lafayette, Indiana, USA SIGMOD 2010, Indianapolis

ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

ERACER: A Database Approach forStatistical Inference and Data Cleaning

Chris Mayfield Jennifer Neville Sunil Prabhakar

Department of Computer Science, Purdue UniversityWest Lafayette, Indiana, USA

SIGMOD 2010, Indianapolis

Page 2: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Problem

Real data is dirtyI Inconsistent, incomplete, corrupted, outdated, etc.I Safety measures (e.g., constraints) are often not usedI Poor decisions based on dirty data costs billions annually

Data cleaning is hardI Typically ad hoc, interactive, exploratory, etc.I Uncertain process: what to do with the “errors?”I Maintenance of results (e.g., lineage/provenance)I Consumes large amount of data management time

(see Fan, Geerts, & Jia, VLDB 2008 tutorial)

Mayfield, Neville, & Prabhakar SIGMOD 2010 2 of 21

Page 3: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Example 1: Genealogy Data

5M people from Pedigree Resource FileI Person (ind id, birth, death)

I Relative (ind id, rel id, role)

B: 1744D: 1804

B: ?D: ?

B: 1781D: 1807

B: ?D: ?

B: ?D: ?

B: ?D: ?

B: 1743D: 1787

B: 1707D: 1766

B: 1741D: 1771

B: 1763D: 1826

B: ?D: ?

B: 1769D: 1769

B: 1770D: 1770

Integrated from manysources, e.g.:

I Census recordsI Immigration listsI Family history

societiesI Individual

submissions

Mayfield, Neville, & Prabhakar SIGMOD 2010 3 of 21

Page 4: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Example 2: Sensor Networks

I 54 sensors, every 31seconds, for 38 days

I ≈ 18% obviously incorrectI Multiple data types

2M readings of Intel Lab Data

I Sensor (epoch, mote id,temp, humid, light, volt)

I Neighbor (id1, id2,distance)

Source: http://db.csail.mit.edu/labdata/labdata.html

Mayfield, Neville, & Prabhakar SIGMOD 2010 4 of 21

Page 5: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Insight

Correlations within tuples, e.g.:I Birth and death yearsI Temperature and humidity values

Correlations across tuples, e.g.:I Parents and childrenI Neighboring sensors

Apply statistical relational learningI Don’t just clean tuples in isolation

(e.g., functional dependencies)I Propagate inferences multiple times

Input:I Possible tuple

dependenciesI Correlation

model skeletonOutput:

I PDFs formissing data

I Flags fordirty data

Mayfield, Neville, & Prabhakar SIGMOD 2010 5 of 21

Page 6: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Baseline Approach:

Bayesian networks

Exact inference (junction tree)

Bayes Net Toolbox for Matlab

Page 7: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Bayesian Network Formulation

Model template specifies conditional dependencies:

IndividualIndividual

Birth DeathFather

Mother =⇒

F.b

F.d

I.b

M.b

M.d

I.d

C1.b C2.b C3.b

S.b

S.d

C2.d C2.d C3.d

Conditional probability distribution (CPD) at each node:

P(I.d | I.b) death year, given the birth yearP(I.b | M.b , F.b) birth year, given parent birth years

Prior distribution at nodes with no parents: P(I.b)

Simplified version of Relational Bayesian Networks(see e.g., Getoor & Taskar 2007)

Mayfield, Neville, & Prabhakar SIGMOD 2010 7 of 21

Page 8: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Maximum Likelihood Estimation

F.b

F.d

I.b

M.b

M.d

I.d

C1.b C2.b C3.b

S.b

S.d

C2.d C2.d C3.d

1. Learn CPDs from data, e.g.:CREATE TABLE cpt_birth AS

SELECT birth , death , count (*)

FROM person

GROUP BY birth , death;

2. Share CPDs across all nodes:-- P(I.d | I.b = 1750)

SELECT death , count

FROM cpt_birth

WHERE birth = 1750;

3. Run inference (e.g., junction tree)I Construct Bayesian networkI Bind evidence (query from DB)I Extract results (store in DB)

Mayfield, Neville, & Prabhakar SIGMOD 2010 8 of 21

Page 9: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Challenges and Lessons Learned

Limiting model assumptionsI Fixed CPD structure (e.g., always two parents)I Acyclicity constraint (can’t model sensor data)

Potentially millions of parametersI Becomes very inefficientI Floating point underflow

Not scalable to large data setsI DB may not fit into main memoryI Moving data in/out of R, Matlab, etc.

Not designed for data cleaningI Propagates outliers/errors in original dataI Need to look beyond the Markov blanket

Mayfield, Neville, & Prabhakar SIGMOD 2010 9 of 21

Page 10: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

ERACER Approach:

Relational dependency networks

Approximate inference algorithm

SQL-based framework

Integrated data cleaning

Page 11: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Relational Dependency Networks

For example, at each sensor and time epoch:

SensorSensor

Temp Humid

Neighbor Neighbor Neighbor

=⇒

S1.t

S1.hS2.t

S4.h

S2.h S3.t

S3.h S4.t

In contrast to Bayesian networks, RDNs:I approximate the full joint distributionI learn CPDs locally based on component modelsI allow cyclic dependencies (i.e., many-to-many)I use aggregation to deal with heterogeneity

(see Neville & Jensen, JMLR 2007)

Mayfield, Neville, & Prabhakar SIGMOD 2010 11 of 21

Page 12: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Component Models

Convolution (for genealogy)I parent age: MPA = P (I.b− P.b)I death age: MDA = P (I.d− I.b)

-- death age model

SELECT hist(death - birth)

FROM person;

0 20 40 60 80 100

0.00

0.02

0.04

Expected value = 54.0 Standard deviation = 29.0

Regression (for sensors)I mean temperature:S.t ∼ β0 + β1 · S.h+ β2 · avg(N.t) + β3 · avg(N.h)

I mean humidity:S.h ∼ γ0 + γ1 · S.t+ γ2 · avg(N.t) + γ3 · avg(N.h)

Mayfield, Neville, & Prabhakar SIGMOD 2010 12 of 21

Page 13: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

ERACER Framework

Learning (one time, offline):1. Extract graph structure using domain knowledge2. RDNs aggregate existing data to learn parameters

Inference (multiple iterations):3. Apply component models to every value in DB4. Combine predictions to deal with heterogeneity5. Evaluate posterior distributions for cleaning6. Repeat 3–5 until happy (i.e., results converge)

Mayfield, Neville, & Prabhakar SIGMOD 2010 13 of 21

Page 14: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Step 1: Extract Graphical Model

Construct nodes:INSERT INTO node

SELECT make_nid(epoch , mote_id), -- creates simple key

new_basis(temp), new_basis(humid),

new_basis(light), new_basis(volt)

FROM sensor;

basis data type:initial original value, if any (e.g., humid)

pdf current prediction (or distribution)suspect data cleaning flag (true = outlier)round when pdf/suspect was last updated

Construct edges:INSERT INTO link

SELECT make_nid(a.epoch , a.mote_id),

make_nid(b.epoch , b.mote_id)

FROM neighbor AS c -- e.g., within 6 meters

INNER JOIN sensor AS a ON c.id1 = a.mote_id

INNER JOIN sensor AS b ON c.id2 = b.mote_id

WHERE a.epoch - 30 <= b.epoch

AND a.epoch + 30 >= b.epoch;

Mayfield, Neville, & Prabhakar SIGMOD 2010 14 of 21

Page 15: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Step 2: Learn RDN Parameters

Aggregate original data values:CREATE TABLE learn AS

SELECT

-- individual instances

min(expect(i.t)) AS ti , min(expect(i.h)) AS hi ,

min(expect(i.l)) AS li , min(expect(i.v)) AS vi ,

-- average neighbor values

avg(expect(n.t)) AS tn , avg(expect(n.h)) AS hn ,

avg(expect(n.l)) AS ln , avg(expect(n.v)) AS vn

FROM node AS i

LEFT JOIN link AS l ON i.nid = l.id1

LEFT JOIN node AS n ON l.id2 = n.nid

GROUP BY i.nid;

Optional: apply noise filters, sample data, etc.

Estimate applicable component modelsI Convolution: use built-in hist aggregateI Regression: export to R; use lm function

Mayfield, Neville, & Prabhakar SIGMOD 2010 15 of 21

Page 16: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Steps 3–6: Approximate Inference

For each round of inference:

1. Update predictions via the erace aggregate queryI Infers/cleans all attributes in a single function call

SELECT erace(i, n)

FROM node AS i

LEFT JOIN link AS l ON i.nid = l.id1

LEFT JOIN node AS n ON l.id2 = n.nid

GROUP BY i;

Key design choice: grouping by tuples, not attributes

2. Store results via CREATE TABLE AS (i.e., propagation)I Faster than UPDATE over the entire relation (MVCC)I Other optimizations possible (e.g., indexes on nid’s)

Mayfield, Neville, & Prabhakar SIGMOD 2010 16 of 21

Page 17: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

erace Aggregate Function

SELECT erace(i, n)

ti hi tn hn

? 40% 21◦ 38%35◦ 23%24◦ ?

For each attribute in i:I Select applicable modelI Apply/combine predictionsI Evaluate (cf. init and prev)

Data cleaning algorithm:I Run inference for known values, as if missingI Is original evidence within expected range?I Replace outliers with inferred distributionsI Do not propagate suspects (rely on other data)

Many more details in the paper!

Mayfield, Neville, & Prabhakar SIGMOD 2010 17 of 21

Page 18: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Experiments:

Generate synthetic populations

Randomly set attributes to NULL

Compare inferred values to original

Page 19: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Genealogy Data Results

Accuracy of birth pdfs:

20 30 40 50 60

02

46

810

Percent Missing (per blanket)

Mea

n A

bsol

ute

Err

or

●● ● ● ●

● ● ●●

BayesNetERACER

0 5 10 15 20 25

05

1015

20

Percent Corrupt (per blanket)

Mea

n A

bsol

ute

Err

or

●●

BayesNetERACER

Variance (uncertainty):

20 30 40 50 60

05

1015

Percent Missing (per blanket)

Ave

rage

Sta

ndar

d D

evia

tion

● ● ● ● ● ● ● ● ●

BayesNetERACER

0 5 10 15 20 25

05

1015

2025

Percent Corrupt (per blanket)

Ave

rage

Sta

ndar

d D

evia

tion

●●

●●

BayesNetERACER

Mayfield, Neville, & Prabhakar SIGMOD 2010 19 of 21

Page 20: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Sensor Data Results

Accuracy of temperature pdfs:

30 40 50 60 70 80 90

01

23

45

Percent Missing

Mea

n A

bsol

ute

Err

or

●● ● ● ● ●

BaselineERACER

10 20 30 40 50 60 70

01

23

45

Percent Corrupt

Mea

n A

bsol

ute

Err

or

BaselineERACER

Accuracy of humidity pdfs:

30 40 50 60 70 80 90

02

46

810

Percent Missing

Mea

n A

bsol

ute

Err

or

● ● ● ● ● ●●

BaselineERACER

10 20 30 40 50 60 70

02

46

810

Percent Corrupt

Mea

n A

bsol

ute

Err

or

●●

BaselineERACER

Mayfield, Neville, & Prabhakar SIGMOD 2010 20 of 21

Page 21: ERACER: A Database Approach for Statistical Inference and ... · Database-centric approach for approximate inference I Statistical framework for correcting errors I E cient; no need

Summary

Database-centric approach for approximate inferenceI Statistical framework for correcting errorsI Efficient; no need to move data to/from R/Matlab

Synergy of imputation and data cleaning tasksI Additional evidence identifies errors more accuratelyI Corrected data values improve the quality of inference

Empirical evaluation on two real-world data setsI Similar accuracy to Bayesian network baselineI Significant gains in runtime performanceI Added benefit of simultaneous data cleaning

Mayfield, Neville, & Prabhakar SIGMOD 2010 21 of 21