36
In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski

In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski

Embed Size (px)

Citation preview

In collaboration with Werner Nutt

Free University of Bozen-Bolzano

Data Quality

Simon Razniewski

10.8.2011 - EURACData Quality 2

Introduction

• Simon RazniewskiPhD Student at the FUB– Data quality– Data completeness

• Werner NuttProfessor in Computer Science at the FUBFocus in research and teaching:– Data management, data modelling– Data integration– Incomplete information

3

Why data quality?

• Data are the basis for (scientific) conclusions about

the world

• Conclusions only as good as the data they are based on

• Low-quality data low-quality conclusions

10.8.2011 - EURACData Quality

4

Some effects of erroneous data are funny

Man invited for pre-natal check

10.8.2011 - EURACData Quality

5

Some data errors are long-living

Spinach contains much iron

100g of spinach contain 35mg of iron

Gustav v. Bunge 1890

100g spinach contain only 3,5mg of iron

10.8.2011 - EURACData Quality

6

Some data errors are mysterious

Student records in in Georgia (USA), 2009

19.000 students leave their school to change to another

… but arrive nowhere

? ?

10.8.2011 - EURACData Quality

7

Overview

• What are data used for?– Data model the real world

• What can go wrong?– Wrong, outdated, missing data

• What can one do for– Correctness– Currency– Completeness of data?

10.8.2011 - EURACData Quality

8

Data model the real world

We analyze the data (instead of the real world)

and draw (scientific) conclusions

data determines our conclusions

Real world: Students, teachers, classes

Database: Tables

HOB Bozen

Class 2A

PaulAnna

MariaDiego

10.8.2011 - EURACData Quality

9

Questions about students

• „How many students are there in the class 2A of the HOB Bozen?“

• „What is the average age of the students of this class?“

• „How many students play an instrument?“

10.8.2011 - EURACData Quality

10

Table „Students“

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

What is the average age of the students of

the class 2A of the HOB Bozen?

10.8.2011 - EURACData Quality

11

Many things can go wrong

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

10.8.2011 - EURACData Quality

What is the average age of the students of

the class 2A of the HOB Bozen?

12

Typos

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

date of birth of Anna

school of Paul

10.8.2011 - EURACData Quality

13

Factual errors

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

school of Diego (“HOB Meran“ instead of “HOB Bozen“)

10.8.2011 - EURACData Quality

14

Outdated entries

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

class of Anna (“1A“ instead of “2A“)

10.8.2011 - EURACData Quality

15

Missing values

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

date of birth of Diego (“Null value“)

10.8.2011 - EURACData Quality

16

Missing records

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

Maria 12.10.1995 HOB Bozen 2A

the record about Maria is missing

10.8.2011 - EURACData Quality

17

Missing concepts

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

no possibility to store information

about music instruments

Instrument

Cello

?

?

10.8.2011 - EURACData Quality

Datenqualität 18

What can be done?

There is a distinction between different dimensions of data quality The most important ones are

• Correctness Does the data match the real world?

• Timeliness Is the data up-to-date?

• Completeness Are all aspects of the domain of interest captured?

Further: Comprehensibility, accessability, …

10.8.2011 - EURAC

Datenqualität 19

Dimension 1: Correctness

IT-techniques:1. Detecting typos or statistical outliers

students born in 1959

2. Recognizing duplicates

Mohammad Al Zaïn = Muhamad Alzain

3. Rules for logical consistency

no student can visit two schools at the same time

Organisation:Special treatment of core data: Master data management

For example: students, teachers, schools10.8.2011 - EURAC

Datenqualität 20

Dimension 2: Timeliness

• By workflow organisation: Bind workflows onto the IT system Timeliness is guaranteed

Example: an enrolment is only valid

if it is recorded in the database

• Trough data about the currency of the data (metadata) Timeliness can be estimated

Example: “All dropouts until 31th of March are recorded“

10.8.2011 - EURAC

Datenqualität 21

Dimension 3/1: Completeness of values

• Can be enforced by the IT system

Risk: nonsensical entries

• Alternative solution: enforce input of less values Record reasons for missing values

E.g. “Not applicable” or “Unknown”

10.8.2011 - EURAC

Datenqualität 22

Dimension 3/3: Conceptual completeness

• Solid design is important, but not everything can be foreseen

• Flexible IT: Schema changes if necessary – Space for comments, additional information

• Otherwise: Other fields will be abused

Example: Gasworks in the USA

Warning of dogs for meter-readers

… later they send bills

Address Mountain Road 102 (Beware of dog)

10.8.2011 - EURAC

Datenqualität 23

Dimension 3/2: Table completeness

• Events are completely recorded, if they are bound to the IT system

Example: Sales in a supermarket

• In general, this binding is not possible

only parts of the database tables are complete

• But: Completeness is only necessary for specific uses

Example: school statistics from ASTAT

Research

10.8.2011 - EURAC

Partial table completeness

• Common scenario: We have – Some, but not all data complete – Questions (‘‘queries“) over data

• Problems:– Do we have the data that is needed to answer

the queries?

If not:– What more data do we need?

10.8.2011 - EURACData Quality 24

An (intuitive) example

• Suppose we have data about all students from – Italian schools– German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“

• Can we correctly answer questions about the italian students in South Tyrol?

Yes, because we have all data about students from

italian schools

10.8.2011 - EURACData Quality 25

An (intuitive) example (2)

• Suppose we have data about all students from – Italian schools

– German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“

• Can we answer questions about the high-school students in South Tyrol?

No, because data from the “Gherdëna“ high school is missing

We could bug them to submit their data (but maybe the secretary is

on holiday)

We could ask someone else for the data, e.g., the local district administration

10.8.2011 - EURACData Quality 26

Our research

• How can one describe that data is complete to a certain extent?

• How can one find out whether the data one has is sufficient for a certain use?

• How can one find out which data is necessary to serve a certain use?

10.8.2011 - EURACData Quality 27

Formal example

How many students attend an Italian school?

SELECT count(*)FROM student, schoolWHERE student.school = school.name AND school.language = ‘italian‘;

Suppose, we have all Italian students.

Can we answer this query completely?

10.8.2011 - EURACData Quality 28

How can we formalize table completeness?

“We have all students from italian schools“

•We imagine: an ideal database that contains complete information about the world

•Completeness statements refer to this ideal database:

=> All ideal students from Italian schools

occur among our real students

10.8.2011 - EURACData Quality 29

How can we assert (partial) completeness of tables?

“We have all students from italian schools“

Table completeness assertion:

real.student CONTAINS(

SELECT ideal.student.*

FROM ideal.student, ideal.school

WHERE ideal.student.school = ideal.school.name

AND ideal.school.language = ‘Italian‘)

Table completeness assertions constitute a

logical theory about real and ideal database

10.8.2011 - EURACData Quality 30

What does it mean that our query is complete?

Consider two versions:

“Real query“ “Ideal query“

SELECT count(*) SELECT count(*)

FROM real.student, real.school FROM ideal.student, ideal.school

WHERE WHERE

real.student.school = real.school.name ideal.student.school = ideal.school.name

AND AND

real.school.language = ‘italian‘; ideal.school.language = ‘italian‘;

Our query is complete if the real and the ideal query return the same results

(Can be expressed in logic, too)

=> Reasoning

10.8.2011 - EURACData Quality 31

Our results so far

• Formalization• General reasoning procedures for

– Single block SQL queries– With comparisons– Group By– Aggregate functions min, max, count, sum

• Complexity analysis (sometimes high!)• Architecture for reasoning system• “Inverse reasoning” (see later slide)

This is a start, many things are still missing

10.8.2011 - EURACData Quality 32

Reasoning with schema information

• To draw interesting inferences, we need to take into account– Keys– Foreign keys– Finite domains

~> Reasoning becomes more complicated

(Current research)

10.8.2011 - EURACData Quality 33

Inverse reasoning

• So far:

Given: Assertions about table completeness

Question: Can query Q be answered completely?

• Also interesting:

Given: query Q

Question: which are the minimal completeness

assertions that assure completeness of Q?

• Can be answered by applying our inference methods backwards

10.8.2011 - EURACData Quality 34

Perspective: Probabilistic completeness management

• Our theory so far:

Boolean statements: complete/not complete

• In practice, it is often sufficient to know

“With probability < p, we make an error < ε“

• Probabilistic assertions:

“With 90% probability, we are not missing more than 5 students“

=> Probabilistic inferences

10.8.2011 - EURACData Quality 35

36

Conclusion

• Data quality has several dimensions– Correctness, timeliness, completeness

• Our current interest– How can one describe which data are complete– How can one find out which queries can be answered

completely?– If not, which additional data is needed?

• Perspective: Probabilistic completeness management

10.8.2011 - EURACData Quality