Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Preview:

DESCRIPTION

 

Citation preview

Towards a Vocabulary for

DQM in Semantic Web

Architectures (Research in Progress)

Christian Fürber and Martin Hepp

christian@fuerber.com, mhepp@computer.org

Presentation @ 1st International Workshop on Linked Web

Data Management,

March 25th, 2011, Uppsala, Sweden

Part 1:

What‘s the Problem?

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

2

Various Data Quality Problems

3

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Untyped literals

Incorrect classification

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

Refe

rence: L

inkin

g O

pen D

ata

clo

ud d

iagra

m, b

y

Ric

hard

Cygania

k a

nd A

nja

Jentz

sch. h

ttp://lo

d-c

loud.n

et/

The Problem

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

4

Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql

Negative Population

Weird Population Values

Invalid URL‘s

Part 2:

What are high quality data?

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

5

What is Data Quality?

• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)

• „Data are of high quality if they are fit for their intended

uses in operations, decision making, and planning. Data

are fit for use if they are free of defects and possess

desired features.“ (Redman 2001)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

6

• Requirements as „Benchmark“

Perspective-Neutral Data Quality

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

7

Data quality is the degree to which

data fulfills quality requirements

…no matter who makes the quality requirements.

The Problem

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

8

Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql

Negative Population

Weird Population Values

Invalid URL‘s

Population

cannot be

negative

Population is

indicated by

numeric values

URL‘s usually

start with http://,

https://, etc.

Quality-

Requirements

Satisfying Quality Requirements

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

9

Status

Quo

Desired

State

= Individuals

Groups

Standards,

etc.

Problem 3: Satisfying

Requirements

Desired

State

Desired

State

Problem 2: Harmonizing

Requirements Problem 1: Expressing

Quality Requirements

Part 3:

Research Goal

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

10

Major Research Goal

• Represent Quality-Relevant information for

automated…

– Data Quality Monitoring

– Data Quality Assessment

– Data Cleansing

– Filtering of High Quality Data

…in a standardized vocabulary.

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

11

Motives for DQM-Vocabulary

• Support people to explicitly express data quality

requirements in „same language“ on Web-Scale

• Support the creation of consensual agreements

upon quality requirements

• Reduce effort for DQM-Activities

• Raise transparency about assumed quality

requirements

• Enable consistency checks among quality

requirements C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

12

Part 4:

Our Approach

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

13

Basic Architecture

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

14

DQM-Vocabulary

Knowledgebase

SPARQL-Query-Engine

RDB A RDB B Data Acquisition

Assessment

Scores Problem

Classification

Cleansed

Data

HQ Data

Retrieval

Main Concepts of DQM-Vocabulary

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

15

Express

Requirements

Express

Cleansing

Tasks

Annotate

Quality

Scores

Classify Quality

Problems

Account for

Task-Dependent

Requirements

Data Quality Problem Types:

Source for Potential Requirements

16

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Incorrect classification

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

Data Quality Requirements

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

17

Syntactical Rules

Semantic Rules

Redundancy Rules

Completeness Rules

Timeliness Rules

Quality-Influencing Artifacts

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

18

Data

Current Focus

of DQM-Vocabulary

Design Alternatives:

Statements about Classes & Properties

(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

19

Part 5:

Application Examples

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

20

Example 1: Legal Value Rule (1/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

21

What instances have illegal values for property foo:country ?

Example 1: Legal Value Rule (2/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

22

dqm:LegalValueRule Class

Instance

Literal value

“foo:Countries“

“foo:countryName“ “tref:countryName“

“tref:Countries“

foo:LegalValueRule_1

Example 1: Legal Value Rule (3/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

23

Example 2: DQ-Assessment (1/2)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

24

How syntactically accurate are all

properties that are subject to

LegalValueRules?

Example 2: DQ-Assessment (2/2)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

25

Part 6:

Conclusions &

Planned Work

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

26

Advantages of DQM-Voabulary

• Minimizes human effort for DQM

• Web-Scale sharing/reuse of data quality

requirements

• Consistency checks among data quality

requirements

• Transparency about applied data quality

rules

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

27

Limitations

• Representation of complex functional

dependency rules and derivation rules

• Limited experience on real world-data sets

• Currently no own concepts for classes and

properties

• Research still in progress

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

28

Future Work • Evaluation of design alternatives

• Development of processing framework

• Representation of more complex

functional dependency rules / derivation

rules

• Extension of DQM-Vobulary

• Evaluation on real-world data sets

• Publication at http://semwebquality.org

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

29

30

Christian Fürber Researcher

E-Business & Web Science Research Group

Werner-Heisenberg-Weg 39

85577 Neubiberg

Germany

skype c.fuerber

email christian@fuerber.com

web http://www.unibw.de/ebusiness

homepage http://www.fuerber.com

twitter http://www.twitter.com/cfuerber

Paper available at http://bit.ly/gYEDdQ

Recommended