Upload
christian-fuerber
View
4.822
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Towards a Vocabulary for
DQM in Semantic Web
Architectures (Research in Progress)
Christian Fürber and Martin Hepp
[email protected], [email protected]
Presentation @ 1st International Workshop on Linked Web
Data Management,
March 25th, 2011, Uppsala, Sweden
Part 1:
What‘s the Problem?
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
2
Various Data Quality Problems
3
Character alignment violation
Invalid characters
Word transpositions
Invalid substrings Mistyping / Misspelling errors
False values
Misfielded values
Meaningless values
Missing values
Out of range values
Functional Dependency
Violation
Incorrect reference
Referential integrity violation
Contradictory relationships
Imprecise values
Existence of Synonyms
Existence of Homonyms
Unique value violation
Inconsistent duplicates
Approximate duplicates
Outdated values Outdated conceptual elements
Cardinality violation
Missing classification
Untyped literals
Incorrect classification
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
Refe
rence: L
inkin
g O
pen D
ata
clo
ud d
iagra
m, b
y
Ric
hard
Cygania
k a
nd A
nja
Jentz
sch. h
ttp://lo
d-c
loud.n
et/
The Problem
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
4
Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql
Negative Population
Weird Population Values
Invalid URL‘s
Part 2:
What are high quality data?
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
5
What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)
• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
uses in operations, decision making, and planning. Data
are fit for use if they are free of defects and possess
desired features.“ (Redman 2001)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
6
• Requirements as „Benchmark“
Perspective-Neutral Data Quality
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
7
Data quality is the degree to which
data fulfills quality requirements
…no matter who makes the quality requirements.
The Problem
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
8
Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql
Negative Population
Weird Population Values
Invalid URL‘s
Population
cannot be
negative
Population is
indicated by
numeric values
URL‘s usually
start with http://,
https://, etc.
Quality-
Requirements
Satisfying Quality Requirements
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
9
Status
Quo
Desired
State
= Individuals
Groups
Standards,
etc.
Problem 3: Satisfying
Requirements
Desired
State
Desired
State
Problem 2: Harmonizing
Requirements Problem 1: Expressing
Quality Requirements
Part 3:
Research Goal
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
10
Major Research Goal
• Represent Quality-Relevant information for
automated…
– Data Quality Monitoring
– Data Quality Assessment
– Data Cleansing
– Filtering of High Quality Data
…in a standardized vocabulary.
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
11
Motives for DQM-Vocabulary
• Support people to explicitly express data quality
requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
requirements
• Enable consistency checks among quality
requirements C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
12
Part 4:
Our Approach
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
13
Basic Architecture
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
14
DQM-Vocabulary
Knowledgebase
SPARQL-Query-Engine
RDB A RDB B Data Acquisition
Assessment
Scores Problem
Classification
Cleansed
Data
HQ Data
Retrieval
Main Concepts of DQM-Vocabulary
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
15
Express
Requirements
Express
Cleansing
Tasks
Annotate
Quality
Scores
Classify Quality
Problems
Account for
Task-Dependent
Requirements
Data Quality Problem Types:
Source for Potential Requirements
16
Character alignment violation
Invalid characters
Word transpositions
Invalid substrings Mistyping / Misspelling errors
False values
Misfielded values
Meaningless values
Missing values
Out of range values
Functional Dependency
Violation
Incorrect reference
Referential integrity violation
Contradictory relationships
Imprecise values
Existence of Synonyms
Existence of Homonyms
Unique value violation
Inconsistent duplicates
Approximate duplicates
Outdated values Outdated conceptual elements
Cardinality violation
Missing classification
Incorrect classification
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
Data Quality Requirements
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
17
Syntactical Rules
Semantic Rules
Redundancy Rules
Completeness Rules
Timeliness Rules
Quality-Influencing Artifacts
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
18
Data
Current Focus
of DQM-Vocabulary
Design Alternatives:
Statements about Classes & Properties
(1) Using classes and properties as subjects
(2) Using datatype properties with xsd:anyURI
(3) Mapping class and property URI‘s to new URI‘s
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
19
Part 5:
Application Examples
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
20
Example 1: Legal Value Rule (1/3)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
21
What instances have illegal values for property foo:country ?
Example 1: Legal Value Rule (2/3)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
22
dqm:LegalValueRule Class
Instance
Literal value
“foo:Countries“
“foo:countryName“ “tref:countryName“
“tref:Countries“
foo:LegalValueRule_1
Example 1: Legal Value Rule (3/3)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
23
Example 2: DQ-Assessment (1/2)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
24
How syntactically accurate are all
properties that are subject to
LegalValueRules?
Example 2: DQ-Assessment (2/2)
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
25
Part 6:
Conclusions &
Planned Work
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
26
Advantages of DQM-Voabulary
• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
requirements
• Consistency checks among data quality
requirements
• Transparency about applied data quality
rules
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
27
Limitations
• Representation of complex functional
dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
properties
• Research still in progress
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
In SemWeb Architectures
28
Future Work • Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
functional dependency rules / derivation
rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM
in SemWeb Architectures
29
30
Christian Fürber Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email [email protected]
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
twitter http://www.twitter.com/cfuerber
Paper available at http://bit.ly/gYEDdQ