Data Quality: Not Your Typical Database Problem

Preview:

DESCRIPTION

Ahmed K. Elmagarmid (IEEE Fellow and ACM Distinguished Scientist) gave a lecture on Data Quality: Not Your Typical Database Problem in the Distinguished Lecturer Series - Leon The Mathematician.

Citation preview

Data QualityNot your Typical Database Problem

2011 © Copyright QCRI. Confidential document.

Not your Typical Database Problem

Ahmed Elmagarmid

Executive DirectorQatar Computing Research Institute

Where are we located?

2011 © Copyright QCRI. Confidential document.

Where are we located?

2

2011 © Copyright QCRI. Confidential document. 33

2011 © Copyright QCRI. Confidential document. 4

Qatar Foundation

2011 © Copyright QCRI. Confidential document.

Qatar Foundation

5

EDUCATIONSCIENCE &RESEARCH

COMMUNITY DEVELOPMENT

2.8 percent of GDP to be spent on research

2011 © Copyright QCRI. Confidential document.

be spent on research annually by 2015

Qatar

Biomedical

Qatar Energy &

Environment

Qatar

Computing

Qatar Foundation Research Division

2011 © Copyright QCRI. Confidential document.

Biomedical

Research

Institute

QBRI

Environment

Research

Institute

QEERI

Computing

Research

Institute

QCRI

QCRI Overview

2011 © Copyright QCRI. Confidential document.

QCRI Overview

8

QCRI Vision

To make Qatar a global center forcomputing research by becoming theworld’s recognized leader in Arabic

2011 © Copyright QCRI. Confidential document.

world’s recognized leader in Arabiclanguage technologies and in key areasvital to the global growth of Qataribusiness and entrepreneurial activity .

9

AcademiaAcademia

National Institutions (QCRI)

� Grand practical challenges � National and global impact� Localized skills & knowledge

National Institutions (QCRI)

� Grand practical challenges � National and global impact� Localized skills & knowledge

Gra

nd C

halle

ngesQCRI Model

2011 © Copyright QCRI. Confidential document.

10

� Individual projects� Students move on� Theoretical & basic

research

� Individual projects� Students move on� Theoretical & basic

research

� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI

� Localized skills & knowledge� Large teams and long term� Example peers: INRIA, MPI G

rand

Cha

lleng

esP

roje

ct-b

ased

Basic Research Applied Research

Research Parks

� Commercialization� Entrepreneurship� Incubation

Research Parks

� Commercialization� Entrepreneurship� Incubation

10

QCRI Ecosystem

QCRIQCRI

QUQU

HKUHKUQEERIQEERI

QBRIQBRISidraSidra MITMIT

2011 © Copyright QCRI. Confidential document.

BoeingBoeing

AljazeeraAljazeera

YahooYahooGoogleGoogle

MicrosoftMicrosoft

ALTISALTIS

QSAQSAMEEZAMEEZA

QSTPQSTP

QPQP

Energy Co.

Energy Co.

WikiMediaWikiMedia

IBMIBM

11

Arabic Language

Social Computing

Scientific Computing

QCRI Research Centers

2011 © Copyright QCRI. Confidential document.

Language Technologies

Computing Computing

Data Analytics

Cloud Computing

12

Prof. Rich DeMilloGeorgia Tech, Chair

QCRI Scientific Advisory Council

Prof. Joichi Ito MIT Media Lab Director

Prof. Ruzena BajcsyUniversity of California – Berkeley

Lord Rupert RedesdaleUK House of Lords

2011 © Copyright QCRI. Confidential document. 13

Lew TuckerVice President, Cisco

Prof. Alfred V. AhoColumbia University

Yousef KhalidiVice President, Microsoft

Prof. Dick LiptonGeorgia Tech

Rashid

KamalHalima

Kulood

The 60 Doers!

Scientific

Ihab Mourad

Michele

John

Chu

Amr

ElKindi

Nan

Data AnalyticsPaolo

Management and Support Team

AhmedAbdellatif

Agathe

Jill

Melissa

Amal

Nada

Samreen

Computing

Richard P.

Richard

Hend

2011 © Copyright QCRI. Confidential document.

Simon G.

MohamedSimon P.

Maged

William

Khulood

Amira

Ahmed A.

Gokop

Mustafa

Cloud Computing

Arabic Language

Technologies

Kareem Stephan

Wei

Preslav

Lolwa

Ahmed T.

Francisco

ThuyLinh

Safdar

Sihem

AyshaSofiane

MikalaiRuth

Gautam

Social Computing

Samreen

Othmane

Ahmed A.

Aybuke Shameem

Walid Peng

Ahmed M.

Ahmed T.

Khaled

Tarek

Strategic Partnerships

2011 © Copyright QCRI. Confidential document.

Strategic Partnerships

15

AgendaStrategic Partnerships

2011 © Copyright QCRI. Confidential document. 16

5-YEAR QCRI MANPOWER PLAN

82102

110

2011 © Copyright QCRI. Confidential document.

10-11 11-12 12-13 13-14 14-15

21

34+13 +48 +20 +8

17

This Talk

2011 © Copyright QCRI. Confidential document.

This Talk

Data Quality

18

Data Quality

Enhancing the usability of the acquired data and increasing the confidence of query results"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group

2011 © Copyright QCRI. Confidential document.

state of denial about this issue. " -Gartner Group

19

Real life data is often dirty: Data

error rates in industry: 1% - 30%

(Redman, 1998)

Dirty Data is Expensive

Obama administration offered

$19 billion grants for health IT, i.e.

improve EMRs in 2009

2011 © Copyright QCRI. Confidential document. 20

Erroneously priced data in retail

databases costs US customers

$2.5 billion each year

The Data Warehousing Institute

estimates that data quality

problems cost U.S. businesses

more than $600 billion a year

(2002)

Where to start? Data Quality everywhere !

• Data Entry• Information Extraction• Integration from multiple sources

2011 © Copyright QCRI. Confidential document.

• Integration from multiple sources• Standardization and transformation• Business rules compliance

21

““““Academic ”””” Data Cleaning

● Pick a well understood data problem under some scoping

assumptions and solve independently

� Duplicates

� Functional Dependency violations

� Matching dependency violations

2011 © Copyright QCRI. Confidential document.

� Matching dependency violations

� Missing value imputation

● Piece-meal approach to tackle the complexity and sometimes the

intractability of the problem

� Repairing violations of FD constraints in special cases (no deletion, left hand

side changes only, allowing variable etc.)

22

““““Academic ”””” Data Cleaning

• Despite their theoretic and algorithmic beauty, rarely used

– Problems never exist in isolation

– Fixes to one problem often introduce “other” problems– Data usually not accessible to mess with

2011 © Copyright QCRI. Confidential document.

– Data usually not accessible to mess with– Integrity constraints!... What integrity constraints?!!

23

““““Practitioner ”””” Data Cleaning

• Will share some scary stories

– “post-it notes” as an expert messaging system– “written permission” to change value of a record– Default values and best practices

2011 © Copyright QCRI. Confidential document.

– Default values and best practices

– “Call John.. He will know what to do”

24

This Talk

● Few data quality challenges and (hopefully) research

directions

2011 © Copyright QCRI. Confidential document.

● Summary of recent efforts at QCRI

25

10 Data Quality Issues

2011 © Copyright QCRI. Confidential document.

10 Data Quality Issues

26

Issue 1: The data trio

2011 © Copyright QCRI. Confidential document. 27

Quality

DATA

Extraction remains a key source of data errors

Acquiring the semantics/schema of the underlying un structured data sources (document, emails, related Web info, click traces, profiles, interests, etc.)

2011 © Copyright QCRI. Confidential document. 28

Integration aggravates the problem

Linked data as an attempt to live with errors .. link as you go

2011 © Copyright QCRI. Confidential document. 29

m1

Slide 29

m1 I'm not sure about this idea of putting "linked data" so prominent in this slide on IImourad, 7/23/2011

Issue 2: Data level or application level

• Cleaning data tables by trusting the schema table! Is rarely useful• Will share a story

– Bell-core with 1800 inter-linked databases– Rule-based logic for sanity checking– Post-it messages to communicate between data quality officers

2011 © Copyright QCRI. Confidential document.

– Post-it messages to communicate between data quality officers .. Who work in shifts!

– Data cleaning action is meaningless if not tied to a business logic or to a process. Should never be against FDs

30

Issue 3: Protect your gain: DQ Dashboard

● How to protect against going backwards

● How to protect your gains during the cleansing process

● Metrics:

2011 © Copyright QCRI. Confidential document.

● Metrics:

�Minimality Principle: mostly and widely used in academic

cleaning

�Value of information: to spot the most important problem to fix

31

Issue 3: Protect your gain - Ideas

• Root-cause analysis for data cleaning

• Chase problems to the source to reason about “progress”

2011 © Copyright QCRI. Confidential document.

• Leveraging “Provenance” to design progress meters

32

Issue 4: Data is not an orphan!

● Data Stewards are not imaginary characters! Important data

has stewards and custodians

● Need to go through these guardians first

2011 © Copyright QCRI. Confidential document.

� Some health care requires a signed form per changed cell stating

reasons for change

● Possible approaches:

� How to avoid stewards?

� How to integrate them in the process or minimize their involvement?

33

Issue 5: How clean is clean?

• Quality awareness eats up 10% of the budget [Telecom Experience]

• How to avoid over-cleaning

• Example: “Bill Forgiveness”, a real-life experience: roaming

2011 © Copyright QCRI. Confidential document.

• Example: “Bill Forgiveness”, a real-life experience: roaming charges and cross-carrier calls have a very complicated business model

• Possible approaches

– Measure cleaning progress

– Clean only to satisfy some application needs

34

Issue 6: Online cleaning a necessity not a feature

● We live in a complex world → complex applications with 100s and 1000s of components and parameters

● Clean as you go .. Clean on demand .. Clean opportunistically .. Can be the only hope

2011 © Copyright QCRI. Confidential document.

● New concepts:� Iterative cleaning

� Cleaning dynamic and evolving data

● Off-line cleaning can still benefit historical data but is becoming less and less important

35

Issue 7: Application quality

• Data Quality → Information Quality → Application quality

• Realizes the levels of complexity in current BI apps

• Data usage should influence data cleaning

2011 © Copyright QCRI. Confidential document.

• Data usage should influence data cleaning

– “Usage-based” data cleaning

36

Issue 8: SW engineering DQ

• Current focus on discrete values with simple integrity constraints (FD, uniqueness…)

• We are good at checking if data complies with rules

• Real business rules are often “assertions” and expressed in

2011 © Copyright QCRI. Confidential document.

• Real business rules are often “assertions” and expressed in “turing-complete” languages

• Checking “did we write the assertions right?” becomes a lot harder

• But also.. need to think if we wrote the right assertions!

37

Issue 9: DQ Theory?

• ACID in transaction management were not only sensible requirements but also had algorithms and methods to enforce them during transactions processing

• Does it make sense to do the same for Quality? Plausible properties along

2011 © Copyright QCRI. Confidential document.

• Does it make sense to do the same for Quality? Plausible properties along with actions for maintaining acceptable quality during data manipulation

• Some of these already exist: Timeliness, Currency, Consistency, etc. but lack methods of enforcement

38

Issue 10: Scale .. Scale

• Terabytes and Petabytes of data requires new ways to enforce data quality

• Which ball to drop

2011 © Copyright QCRI. Confidential document.

• Leveraging application semantics and data usage

• Sampling to learn from the few and apply on the masses

• Active learning to replace human feedback (GDR as a solution)

39

Sample QCRI Projects

2011 © Copyright QCRI. Confidential document.

Sample QCRI Projects

40

GDR – Guided Data Repair

• Scalable ways to involve experts

• Repurposing destructive automatic techniques to guide repairs

• Value of Information measures to generate the most important

questions

• Judicious use of active learning from user feedbackUser QueryUser Query

2011 © Copyright QCRI. Confidential document. 41

• Judicious use of active learning from user feedback

Input Database

Instance

Detect Errors

and Violations

Learn and

Repair

Database

Clean Database

Instance

User QueryUser Query

Results

GDR Architecture

2011 © Copyright QCRI. Confidential document. 42

Probabilistic Data Cleaning

Uncertain

Error Detection

Possible

Repair

Generation

Clean Database

Instance

User QueryUser Query

Clean Database

InstancePossible

2011 © Copyright QCRI. Confidential document. 43

Input

Database

Instance

Instance

Probabilistic Results

InstancePossible

Clean Instance

Possible Repairs

A possible repair is a clustering of the input tuples

Person

X1

{P1}

X2

{P1,P2}

X3

{P1,P2,P5}

Possible RepairsID Name ZIP Income

P1 Green 51519 30k

2011 © Copyright QCRI. Confidential document. 44

Uncertain Clustering

{P1}

{P2}

{P3,P4}

{P5}

{P6}

{P1,P2}

{P3,P4}

{P5}

{P6}

{P1,P2,P5}

{P3,P4}

{P6}

P1 Green 51519 30k

P2 Green 51518 32k

P3 Peter 30528 40k

P4 Peter 30528 40k

P5 Gree 51519 55k

P6 Chuck 51519 30k

Thank You

2011 © Copyright QCRI. Confidential document.

Thank You

www.qcri.qa

Recommended