Upload
christian-fuerber
View
1.893
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Using Semantic Web
Resources for Data Quality
Management
Christian Fürber and Martin Hepp
[email protected], [email protected]
Presentation at the 17th International Conference on
Knowledge Engineering and Knowledge Management,
October 10-15, 2010, Lisbon, Portugal
Purpose of Data
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 2
101010101
010101010
101010101
001010101
001010101
DATA
Measurement
Automation
Information &
Knowledge
Decisions
Data Quality in Practice
3
Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
4
The Web of Messy Data?
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Which one is
the correct
population?
Retr
ieve
d fro
m h
ttp
://d
bp
edia
.org
/sp
arq
l o
n J
uly
20
th
5
The Web of Messy Data?
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Retr
ieve
d fro
m h
ttp
://d
bp
edia
.org
/sp
arq
l o
n J
uly
20
th
Places with
negative
population?!?
Risk of Failure
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 6
101010101
010101010
101010101
001010101
001010101
DATA
Measurement
Automation
Information &
Knowledge
Decisions
Data Quality Problem Types
7
Character alignment violation
Invalid characters
Word transpositions
Invalid substrings Mistyping / Misspelling errors
False values
Misfielded values
Meaningless values
Missing values
Out of range values
Functional Dependency
Violation
Incorrect reference
Referential integrity violation
Contradictory relationships
Imprecise values
Existence of Synonyms
Existence of Homonyms
Unique value violation
Inconsistent duplicates
Approximate duplicates
Outdated values Outdated conceptual elements
Cardinality violation
Missing classification
Untyped literals
Incorrect classification
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Refe
rence: L
inkin
g O
pen D
ata
clo
ud d
iagra
m, b
y
Ric
hard
Cygania
k a
nd A
nja
Jentz
sch. h
ttp://lo
d-c
loud.n
et/
Goals
• Use Semantic Web data to identify data
quality problems on instance level
• Support Data Quality Management (DQM)
process
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 8
Total Data Quality Management
for and based on the Semantic Web
9
Measure
Analyze Improve
Define
DQ
Reference: Richard Wang (1998)
Define what‘s
good and / or
what‘s poor
data quality
Develop and
apply SPARQL
queries based
on DQ-
Definition
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
How can the Semantic Web support
Data Quality Management?
10
Availability of FREE Data Quality Knowledge,
e.g. for the identification of…
• Legal value violations
• Functional dependency violations
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Using Trusted References
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 11
local:Location tref:Location
Las Vegas
France
Las Vegas
USA
Las Vegas France
Tested Knowledgebase Trusted Reference
DQ-Constraints
Basic Architecture
12 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Basic Characteristics of SPIN
• Allows definition of generalized
SPARQL query templates
• Constraint checking based on
SPARQL
• Definition of inferencing rules via
SPARQL
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 13
http://spinrdf.org/
Generic Data Quality Constraints
Library for Easy DQ-Defintion
14 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
available @ http://semwebquality.org/ontologies/dq-constraints#
• Mandatory properties &
literals
• Legal values*
• Legal value ranges
• Functional dependencies*
• Legal syntaxes
• Uniqueness
* Designed to use trusted references
Definition of Data Quality
Constraints based on SPIN
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 15
Constraint checking in Practice
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 16
Legal Value Constraints
17
SELECT ?s
WHERE {
?s a vcard:Address .
?s vcard:country-name ?value .
OPTIONAL {
?s2 a tref:Location .
?s2 tref:country ?value1 .
FILTER(str(?value1)= str(?value))
} .
FILTER(!bound(?value1))
}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Return all instances of class vcard:Address that do not have a
matching value for property vcard:country-name in property
tref:country
Functional Dependency Constraints
18
SELECT ?s
WHERE {
?s a gr:LocationOfSalesOrServiceProvisioning .
?s vcard:ADR ?node
?node vcard:city ?value1 .
?node vcard:country ?value2 .
NOT EXISTS {
?s2 a gn:Location .
?s2 gn:asciiname ?value1 .
?s2 gn:country ?value2 .
}}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Return all instances of vcard:ADR with city-country-combinations
that do not have a matching pair in instances of gn:Location.
Acquisition of Semantic Web
Sources for DQM
(1) Replication of relevant knowledge-bases
(2) On the fly via federated SPARQL queries:
19
PREFIX dbo:<http://dbpedia.org/ontology/>
SELECT *
WHERE {
?s1 :location_CITY ?city .
OPTIONAL{
SERVICE <http://dbpedia.org/sparql>{
?s2 a dbo:City .
?s2 rdfs:label ?city .
FILTER (lang(?city) = "en") .
}
}
FILTER(!bound(?s2))
}
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Limitations
• High degree of uncertainty about quality of Semantic
Web resources
• Risk for data quality problem proliferation
• Lack of Semantic Web resources for certain domains
• Flexible design of RDF and structural heterogeneity
complicate definition of generic DQ constraints
• Scalability on large data sets
• DQ constraints close the world
20 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Contributions
• Data quality control for Semantic Web data
• Identification of potential inconsistencies
between Semantic Web Resources
• Reduction of effort for the definition of functional
dependency rules and legal value rules
• Reuse of shared data quality rules on a Web
scale
21 C. Fürber, M. Hepp:
Using SemWeb Resources for DQM
Future Work
• Semantic Web information quality assessment
framework (SWIQA) with computation of KPI‘s
• Analysis and identification of useful „trusted
references“ based on SWIQA
• Application on multi-source master data of
information systems
• Evaluation on large data sets
C. Fürber, M. Hepp:
Using SemWeb Resources for DQM 22
23
Christian Fürber Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email [email protected]
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
twitter http://www.twitter.com/cfuerber
Data Quality Constraints Library for SPIN @
http://semwebquality.org/ontologies/dq-constraints#
Paper available at http://bit.ly/c5v6TM