23
Using Semantic Web Resources for Data Quality Management Christian Fürber and Martin Hepp [email protected], [email protected] Presentation at the 17th International Conference on Knowledge Engineering and Knowledge Management, October 10-15, 2010, Lisbon, Portugal

Using Semantic Web Resources for Data Quality Management

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Using Semantic Web Resources for Data Quality Management

Using Semantic Web

Resources for Data Quality

Management

Christian Fürber and Martin Hepp

[email protected], [email protected]

Presentation at the 17th International Conference on

Knowledge Engineering and Knowledge Management,

October 10-15, 2010, Lisbon, Portugal

Page 2: Using Semantic Web Resources for Data Quality Management

Purpose of Data

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 2

101010101

010101010

101010101

001010101

001010101

DATA

Measurement

Automation

Information &

Knowledge

Decisions

Page 3: Using Semantic Web Resources for Data Quality Management

Data Quality in Practice

3

Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 4: Using Semantic Web Resources for Data Quality Management

4

The Web of Messy Data?

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Which one is

the correct

population?

Retr

ieve

d fro

m h

ttp

://d

bp

edia

.org

/sp

arq

l o

n J

uly

20

th

Page 5: Using Semantic Web Resources for Data Quality Management

5

The Web of Messy Data?

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Retr

ieve

d fro

m h

ttp

://d

bp

edia

.org

/sp

arq

l o

n J

uly

20

th

Places with

negative

population?!?

Page 6: Using Semantic Web Resources for Data Quality Management

Risk of Failure

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 6

101010101

010101010

101010101

001010101

001010101

DATA

Measurement

Automation

Information &

Knowledge

Decisions

Page 7: Using Semantic Web Resources for Data Quality Management

Data Quality Problem Types

7

Character alignment violation

Invalid characters

Word transpositions

Invalid substrings Mistyping / Misspelling errors

False values

Misfielded values

Meaningless values

Missing values

Out of range values

Functional Dependency

Violation

Incorrect reference

Referential integrity violation

Contradictory relationships

Imprecise values

Existence of Synonyms

Existence of Homonyms

Unique value violation

Inconsistent duplicates

Approximate duplicates

Outdated values Outdated conceptual elements

Cardinality violation

Missing classification

Untyped literals

Incorrect classification

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Refe

rence: L

inkin

g O

pen D

ata

clo

ud d

iagra

m, b

y

Ric

hard

Cygania

k a

nd A

nja

Jentz

sch. h

ttp://lo

d-c

loud.n

et/

Page 8: Using Semantic Web Resources for Data Quality Management

Goals

• Use Semantic Web data to identify data

quality problems on instance level

• Support Data Quality Management (DQM)

process

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 8

Page 9: Using Semantic Web Resources for Data Quality Management

Total Data Quality Management

for and based on the Semantic Web

9

Measure

Analyze Improve

Define

DQ

Reference: Richard Wang (1998)

Define what‘s

good and / or

what‘s poor

data quality

Develop and

apply SPARQL

queries based

on DQ-

Definition

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 10: Using Semantic Web Resources for Data Quality Management

How can the Semantic Web support

Data Quality Management?

10

Availability of FREE Data Quality Knowledge,

e.g. for the identification of…

• Legal value violations

• Functional dependency violations

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 11: Using Semantic Web Resources for Data Quality Management

Using Trusted References

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 11

local:Location tref:Location

Las Vegas

France

Las Vegas

USA

Las Vegas France

Tested Knowledgebase Trusted Reference

DQ-Constraints

Page 12: Using Semantic Web Resources for Data Quality Management

Basic Architecture

12 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 13: Using Semantic Web Resources for Data Quality Management

Basic Characteristics of SPIN

• Allows definition of generalized

SPARQL query templates

• Constraint checking based on

SPARQL

• Definition of inferencing rules via

SPARQL

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 13

http://spinrdf.org/

Page 14: Using Semantic Web Resources for Data Quality Management

Generic Data Quality Constraints

Library for Easy DQ-Defintion

14 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

available @ http://semwebquality.org/ontologies/dq-constraints#

• Mandatory properties &

literals

• Legal values*

• Legal value ranges

• Functional dependencies*

• Legal syntaxes

• Uniqueness

* Designed to use trusted references

Page 15: Using Semantic Web Resources for Data Quality Management

Definition of Data Quality

Constraints based on SPIN

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 15

Page 16: Using Semantic Web Resources for Data Quality Management

Constraint checking in Practice

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 16

Page 17: Using Semantic Web Resources for Data Quality Management

Legal Value Constraints

17

SELECT ?s

WHERE {

?s a vcard:Address .

?s vcard:country-name ?value .

OPTIONAL {

?s2 a tref:Location .

?s2 tref:country ?value1 .

FILTER(str(?value1)= str(?value))

} .

FILTER(!bound(?value1))

}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Return all instances of class vcard:Address that do not have a

matching value for property vcard:country-name in property

tref:country

Page 18: Using Semantic Web Resources for Data Quality Management

Functional Dependency Constraints

18

SELECT ?s

WHERE {

?s a gr:LocationOfSalesOrServiceProvisioning .

?s vcard:ADR ?node

?node vcard:city ?value1 .

?node vcard:country ?value2 .

NOT EXISTS {

?s2 a gn:Location .

?s2 gn:asciiname ?value1 .

?s2 gn:country ?value2 .

}}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Return all instances of vcard:ADR with city-country-combinations

that do not have a matching pair in instances of gn:Location.

Page 19: Using Semantic Web Resources for Data Quality Management

Acquisition of Semantic Web

Sources for DQM

(1) Replication of relevant knowledge-bases

(2) On the fly via federated SPARQL queries:

19

PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT *

WHERE {

?s1 :location_CITY ?city .

OPTIONAL{

SERVICE <http://dbpedia.org/sparql>{

?s2 a dbo:City .

?s2 rdfs:label ?city .

FILTER (lang(?city) = "en") .

}

}

FILTER(!bound(?s2))

}

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 20: Using Semantic Web Resources for Data Quality Management

Limitations

• High degree of uncertainty about quality of Semantic

Web resources

• Risk for data quality problem proliferation

• Lack of Semantic Web resources for certain domains

• Flexible design of RDF and structural heterogeneity

complicate definition of generic DQ constraints

• Scalability on large data sets

• DQ constraints close the world

20 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 21: Using Semantic Web Resources for Data Quality Management

Contributions

• Data quality control for Semantic Web data

• Identification of potential inconsistencies

between Semantic Web Resources

• Reduction of effort for the definition of functional

dependency rules and legal value rules

• Reuse of shared data quality rules on a Web

scale

21 C. Fürber, M. Hepp:

Using SemWeb Resources for DQM

Page 22: Using Semantic Web Resources for Data Quality Management

Future Work

• Semantic Web information quality assessment

framework (SWIQA) with computation of KPI‘s

• Analysis and identification of useful „trusted

references“ based on SWIQA

• Application on multi-source master data of

information systems

• Evaluation on large data sets

C. Fürber, M. Hepp:

Using SemWeb Resources for DQM 22

Page 23: Using Semantic Web Resources for Data Quality Management

23

Christian Fürber Researcher

E-Business & Web Science Research Group

Werner-Heisenberg-Weg 39

85577 Neubiberg

Germany

skype c.fuerber

email [email protected]

web http://www.unibw.de/ebusiness

homepage http://www.fuerber.com

twitter http://www.twitter.com/cfuerber

Data Quality Constraints Library for SPIN @

http://semwebquality.org/ontologies/dq-constraints#

Paper available at http://bit.ly/c5v6TM