30
Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp [email protected], [email protected] Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Towards a Vocabulary for

DQM in Semantic Web

Architectures (Research in Progress)

Christian Fürber and Martin Hepp

[email protected], [email protected]

Presentation @ 1st International Workshop on Linked Web

Data Management,

March 25th, 2011, Uppsala, Sweden

Page 2: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 1:

What‘s the Problem?

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

2

Page 3: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Various Data Quality Problems

3

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Untyped literals

Incorrect classification

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

Refe

rence: L

inkin

g O

pen D

ata

clo

ud d

iagra

m, b

y

Ric

hard

Cygania

k a

nd A

nja

Jentz

sch. h

ttp://lo

d-c

loud.n

et/

Page 4: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

The Problem

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

4

Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql

Negative Population

Weird Population Values

Invalid URL‘s

Page 5: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 2:

What are high quality data?

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

5

Page 6: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

What is Data Quality?

• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)

• „Data are of high quality if they are fit for their intended

uses in operations, decision making, and planning. Data

are fit for use if they are free of defects and possess

desired features.“ (Redman 2001)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

6

• Requirements as „Benchmark“

Page 7: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Perspective-Neutral Data Quality

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

7

Data quality is the degree to which

data fulfills quality requirements

…no matter who makes the quality requirements.

Page 8: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

The Problem

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

8

Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql

Negative Population

Weird Population Values

Invalid URL‘s

Population

cannot be

negative

Population is

indicated by

numeric values

URL‘s usually

start with http://,

https://, etc.

Quality-

Requirements

Page 9: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Satisfying Quality Requirements

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

9

Status

Quo

Desired

State

= Individuals

Groups

Standards,

etc.

Problem 3: Satisfying

Requirements

Desired

State

Desired

State

Problem 2: Harmonizing

Requirements Problem 1: Expressing

Quality Requirements

Page 10: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 3:

Research Goal

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

10

Page 11: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Major Research Goal

• Represent Quality-Relevant information for

automated…

– Data Quality Monitoring

– Data Quality Assessment

– Data Cleansing

– Filtering of High Quality Data

…in a standardized vocabulary.

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

11

Page 12: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Motives for DQM-Vocabulary

• Support people to explicitly express data quality

requirements in „same language“ on Web-Scale

• Support the creation of consensual agreements

upon quality requirements

• Reduce effort for DQM-Activities

• Raise transparency about assumed quality

requirements

• Enable consistency checks among quality

requirements C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

12

Page 13: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 4:

Our Approach

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

13

Page 14: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Basic Architecture

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

14

DQM-Vocabulary

Knowledgebase

SPARQL-Query-Engine

RDB A RDB B Data Acquisition

Assessment

Scores Problem

Classification

Cleansed

Data

HQ Data

Retrieval

Page 15: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Main Concepts of DQM-Vocabulary

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

15

Express

Requirements

Express

Cleansing

Tasks

Annotate

Quality

Scores

Classify Quality

Problems

Account for

Task-Dependent

Requirements

Page 16: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Data Quality Problem Types:

Source for Potential Requirements

16

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Incorrect classification

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

Page 17: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Data Quality Requirements

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

17

Syntactical Rules

Semantic Rules

Redundancy Rules

Completeness Rules

Timeliness Rules

Page 18: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Quality-Influencing Artifacts

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

18

Data

Current Focus

of DQM-Vocabulary

Page 19: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Design Alternatives:

Statements about Classes & Properties

(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

19

Page 20: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 5:

Application Examples

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

20

Page 21: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Example 1: Legal Value Rule (1/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

21

What instances have illegal values for property foo:country ?

Page 22: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Example 1: Legal Value Rule (2/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

22

dqm:LegalValueRule Class

Instance

Literal value

“foo:Countries“

“foo:countryName“ “tref:countryName“

“tref:Countries“

foo:LegalValueRule_1

Page 23: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Example 1: Legal Value Rule (3/3)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

23

Page 24: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Example 2: DQ-Assessment (1/2)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

24

How syntactically accurate are all

properties that are subject to

LegalValueRules?

Page 25: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Example 2: DQ-Assessment (2/2)

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

25

Page 26: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Part 6:

Conclusions &

Planned Work

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

26

Page 27: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Advantages of DQM-Voabulary

• Minimizes human effort for DQM

• Web-Scale sharing/reuse of data quality

requirements

• Consistency checks among data quality

requirements

• Transparency about applied data quality

rules

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

27

Page 28: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Limitations

• Representation of complex functional

dependency rules and derivation rules

• Limited experience on real world-data sets

• Currently no own concepts for classes and

properties

• Research still in progress

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

In SemWeb Architectures

28

Page 29: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

Future Work • Evaluation of design alternatives

• Development of processing framework

• Representation of more complex

functional dependency rules / derivation

rules

• Extension of DQM-Vobulary

• Evaluation on real-world data sets

• Publication at http://semwebquality.org

C. Fürber, M. Hepp:

Towards a Vocabulary for DQM

in SemWeb Architectures

29

Page 30: Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

30

Christian Fürber Researcher

E-Business & Web Science Research Group

Werner-Heisenberg-Weg 39

85577 Neubiberg

Germany

skype c.fuerber

email [email protected]

web http://www.unibw.de/ebusiness

homepage http://www.fuerber.com

twitter http://www.twitter.com/cfuerber

Paper available at http://bit.ly/gYEDdQ