32
Information Systems 29 (2004) 551–582 The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems $ Monica Scannapieco*, Antonino Virgillito, Carlo Marchetti, Massimo Mecella, Roberto Baldoni Dipartimento di Informatica e Sistemistica, Universit " a di Roma ‘‘La Sapienza’’, Via Salaria 113, Roma 00198, Italy Received 30 May 2003; received in revised form 1 November 2003; accepted 15 December 2003 Abstract In cooperative information systems, the quality of data exchanged and provided by different data sources is extremely important. A lack of attention to data quality can imply data of low quality to spread all over the cooperative system. At the same time, improvement can be based on comparing data, correcting them and thus disseminating high quality data. In this paper, we present an architecture for managing data quality in cooperative information systems, by focusing on two specific modules, the Data Quality Broker and the Quality Notification Service. The Data Quality Broker allows for querying and improving data quality values. The Quality Notification Service is specifically targeted to the dissemination of changes on data quality values. r 2003 Published by Elsevier Ltd. Keywords: Cooperative information system; Data quality; XML model; Data integration; Publish & subscribe 1. Introduction A Cooperative Information System (CIS) is a large scale information system that interconnects various systems of different and autonomous organizations, geographically distributed and sharing common objectives [1]. Among the differ- ent resources that are shared by organizations, data are fundamental; in real world scenarios, an organization A may not request data from an organization B if it does not ‘‘trust’’ B data, i.e., if A does not know that the quality of the data that B can provide is high. As an example, in an e-Government scenario in which public adminis- trations cooperate in order to fulfill service requests from citizens and enterprises [2], admin- istrations very often prefer asking citizens for data, rather than other administrations that have stored the same data, because the quality of such data is not known. Therefore, lack of cooperation may occur due to lack of quality certification. ARTICLE IN PRESS $ This work is supported by the Italian Ministry for Education, University and Research (MIUR), through the COFIN 2001 Project ‘‘DaQuinCIS—Methodologies and Tools for Data Quality inside Cooperative Information Systems’’ (http://www.dis.uniroma1.it/~dq/). The DaQuinCIS pro- ject is a joint research project carried out by DIS—Universit " a di Roma ‘‘La Sapienza’’, DISCo—Universit " a di Milano ‘‘Bicoc- ca’’ and DEI—Politecnico di Milano. *Corresponding author. Tel.: +39-06499-18479; fax: +39- 06853-00849. E-mail addresses: [email protected] (M. Scannapieco), [email protected] (A. Virgillito), [email protected] (C. Marchetti), mecella@dis. uniroma1.it (M. Mecella), [email protected] (R. Baldoni). 0306-4379/$ - see front matter r 2003 Published by Elsevier Ltd. doi:10.1016/j.is.2003.12.004

The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems

Embed Size (px)

Citation preview

ARTICLE IN PRESS

Information Systems 29 (2004) 551ndash582

$This work

Education Un

COFIN 2001 P

for Data Qual

(httpwwwd

ject is a joint res

Roma lsquolsquoLa Sap

carsquorsquo and DEImdash

Correspond

06853-00849

E-mail addr

(M Scannapiec

marchetdisu

uniroma1it (M

0306-4379$ - se

doi101016jis

The DaQuinCIS architecture a platform for exchanging andimproving data quality in cooperative information systems$

Monica Scannapieco Antonino Virgillito Carlo Marchetti Massimo MecellaRoberto Baldoni

Dipartimento di Informatica e Sistemistica Universit a di Roma lsquolsquoLa Sapienzarsquorsquo Via Salaria 113 Roma 00198 Italy

Received 30 May 2003 received in revised form 1 November 2003 accepted 15 December 2003

Abstract

In cooperative information systems the quality of data exchanged and provided by different data sources is

extremely important A lack of attention to data quality can imply data of low quality to spread all over the cooperative

system At the same time improvement can be based on comparing data correcting them and thus disseminating high

quality data In this paper we present an architecture for managing data quality in cooperative information systems by

focusing on two specific modules the Data Quality Broker and the Quality Notification Service The Data Quality

Broker allows for querying and improving data quality values The Quality Notification Service is specifically targeted

to the dissemination of changes on data quality values

r 2003 Published by Elsevier Ltd

Keywords Cooperative information system Data quality XML model Data integration Publish amp subscribe

1 Introduction

A Cooperative Information System (CIS) is alarge scale information system that interconnects

is supported by the Italian Ministry for

iversity and Research (MIUR) through the

roject lsquolsquoDaQuinCISmdashMethodologies and Tools

ity inside Cooperative Information Systemsrsquorsquo

isuniroma1it~dq) The DaQuinCIS pro-

earch project carried out by DISmdashUniversita di

ienzarsquorsquo DISComdashUniversita di Milano lsquolsquoBicoc-

Politecnico di Milano

ing author Tel +39-06499-18479 fax +39-

esses monscandisuniroma1it

o) virgidisuniroma1it (A Virgillito)

niroma1it (C Marchetti) mecelladis

Mecella) baldonidisuniroma1it (R Baldoni)

e front matter r 2003 Published by Elsevier Ltd

200312004

various systems of different and autonomousorganizations geographically distributed andsharing common objectives [1] Among the differ-ent resources that are shared by organizationsdata are fundamental in real world scenarios anorganization A may not request data from anorganization B if it does not lsquolsquotrustrsquorsquo B data ie ifA does not know that the quality of the data thatB can provide is high As an example in ane-Government scenario in which public adminis-trations cooperate in order to fulfill servicerequests from citizens and enterprises [2] admin-istrations very often prefer asking citizens for datarather than other administrations that have storedthe same data because the quality of such data isnot known Therefore lack of cooperation mayoccur due to lack of quality certification

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582552

Uncertified quality can also cause a deteriora-tion of the data quality inside single organizationsIf organizations exchange data without knowingtheir actual quality it may happen that data of lowquality spread all over the CIS On the other handCISrsquos are characterized by high data replicationie different copies of the same data are stored bydifferent organizations From a data qualityperspective this is a great opportunity improve-ment actions can be carried out on the basis ofcomparisons among different copies in ordereither to select the most appropriate one or toreconcile available copies thus producing a newimproved copy to be notified to all interestedorganizations

In this paper we propose an architecture formanaging data quality in CISrsquos The architectureaims to avoid dissemination of low qualified datathrough the CIS by providing a support for dataquality diffusion and improvement To the best ofour knowledge this is a novel contribution in theinformation quality area that aims at integratingin the specific context of CISrsquos both modeling andarchitectural issues

The TDQM CIS methodological cycle [3] de-fines the methodological framework in which the

Fig 1 The phases of the TDQM CIS cycle grey p

proposed architecture fits The TDQM CIS cyclederives from the extension of the TDQM cycle [4]to the context of CISrsquos it consists of the fivephases of Definition Measurement ExchangeAnalysis and Improvement (see Fig 1)

The Definition phase implies the definition of amodel for the data exported by each cooperatingorganization and for the quality data associated tothem Quality is a multi-dimensional concept data

quality dimensions characterize properties that areinherent to data [56] Both data and quality datacan be exported as XML documents as furtherdetailed in Section 3 The Measurement phaseconsists of the evaluation of the data qualitydimensions for the exported data The Exchange

phase implies the exact definition of the exchangedinformation consisting of data and of appropriatequality data corresponding to values of dataquality dimensions The Analysis phase regardsthe interpretation of the quality values containedin the exchanged information by the organizationthat receives it Finally the Improvement phaseconsists of all the actions that enable improve-ments of cooperative data by exploiting theopportunities offered by the cooperative environ-ment An example of improvement action is based

hases are specifically considered in this work

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 553

on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality

The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored

The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture

1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality

inside Cooperative Information Systemsrsquorsquo (httpwwwdis

uniroma1it~dq) is a joint research project carried out by

DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita

di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano

2 A cooperative framework for data quality

In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system

As an example a supply chain in which someenterprises offer basic products and some others

assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]

A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered

A cooperative information system is formed by a

set of organizations fOrg1yOrgng that cooperate

through a communication infrastructure N which

provide software services to organizations as wells

as reliable connectivity Each organization Orgi is

connected to N through a gateway Gi on which

software services offered by Orgi to other organiza-

tions are deployed A user is a software or human

entity residing within an organization and using the

cooperative systemSeveral CISrsquos are characterized by a high degree

of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones

Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations

In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows

ARTICLE IN PRESS

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

Orgn

CooperativeGateway

Orgn

RatingService

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

OrgnOrgn

RatingService

CooperativeGateway

Fig 2 The DaQuinCIS architecture

M Scannapieco et al Information Systems 29 (2004) 551ndash582554

the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2

Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q

model is described in Section 3In order to produce data and quality data

according to the D2Q model each organizationdeploys on its cooperative gateway a Quality

Factory service that is responsible for evaluatingthe quality of its own data The design of the

Quality Factory is out of the scope of this paperand has been addressed in [7]

The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering

function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function

The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4

The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the

ARTICLE IN PRESS

-BusinessName-Code

Business

-FiscalCode-Address-Name

Owner

1

(a)

-FiscalCode-Address-Name

Owner

-EnterpriseName-Code

Enterprise

1

(b)

-EntName-Code-LegalAddress

Enterprise

-FiscalCode-Address-Name

Holder

1

(c)

Fig 3 Local schemas used in the running example conceptual

representation (a) Local schema of INAIL (b) local schema of

INPS (c) Local schema of CoC

M Scannapieco et al Information Systems 29 (2004) 551ndash582 555

Quality Notification Service is deployed as a peer-to-peer system

The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]

21 Running example

We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues

In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information

2Examples of business events are the starting of a new

businesscompany or the closing down variations in legal

status board composition and senior management variations

in the number of employees as well as opening a new location

filling for a patent etc

that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation

Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality

In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown

3 The D2Q model

According to the DaQuinCIS architecture allcooperating organizations export their application

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582552

Uncertified quality can also cause a deteriora-tion of the data quality inside single organizationsIf organizations exchange data without knowingtheir actual quality it may happen that data of lowquality spread all over the CIS On the other handCISrsquos are characterized by high data replicationie different copies of the same data are stored bydifferent organizations From a data qualityperspective this is a great opportunity improve-ment actions can be carried out on the basis ofcomparisons among different copies in ordereither to select the most appropriate one or toreconcile available copies thus producing a newimproved copy to be notified to all interestedorganizations

In this paper we propose an architecture formanaging data quality in CISrsquos The architectureaims to avoid dissemination of low qualified datathrough the CIS by providing a support for dataquality diffusion and improvement To the best ofour knowledge this is a novel contribution in theinformation quality area that aims at integratingin the specific context of CISrsquos both modeling andarchitectural issues

The TDQM CIS methodological cycle [3] de-fines the methodological framework in which the

Fig 1 The phases of the TDQM CIS cycle grey p

proposed architecture fits The TDQM CIS cyclederives from the extension of the TDQM cycle [4]to the context of CISrsquos it consists of the fivephases of Definition Measurement ExchangeAnalysis and Improvement (see Fig 1)

The Definition phase implies the definition of amodel for the data exported by each cooperatingorganization and for the quality data associated tothem Quality is a multi-dimensional concept data

quality dimensions characterize properties that areinherent to data [56] Both data and quality datacan be exported as XML documents as furtherdetailed in Section 3 The Measurement phaseconsists of the evaluation of the data qualitydimensions for the exported data The Exchange

phase implies the exact definition of the exchangedinformation consisting of data and of appropriatequality data corresponding to values of dataquality dimensions The Analysis phase regardsthe interpretation of the quality values containedin the exchanged information by the organizationthat receives it Finally the Improvement phaseconsists of all the actions that enable improve-ments of cooperative data by exploiting theopportunities offered by the cooperative environ-ment An example of improvement action is based

hases are specifically considered in this work

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 553

on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality

The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored

The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture

1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality

inside Cooperative Information Systemsrsquorsquo (httpwwwdis

uniroma1it~dq) is a joint research project carried out by

DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita

di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano

2 A cooperative framework for data quality

In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system

As an example a supply chain in which someenterprises offer basic products and some others

assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]

A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered

A cooperative information system is formed by a

set of organizations fOrg1yOrgng that cooperate

through a communication infrastructure N which

provide software services to organizations as wells

as reliable connectivity Each organization Orgi is

connected to N through a gateway Gi on which

software services offered by Orgi to other organiza-

tions are deployed A user is a software or human

entity residing within an organization and using the

cooperative systemSeveral CISrsquos are characterized by a high degree

of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones

Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations

In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows

ARTICLE IN PRESS

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

Orgn

CooperativeGateway

Orgn

RatingService

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

OrgnOrgn

RatingService

CooperativeGateway

Fig 2 The DaQuinCIS architecture

M Scannapieco et al Information Systems 29 (2004) 551ndash582554

the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2

Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q

model is described in Section 3In order to produce data and quality data

according to the D2Q model each organizationdeploys on its cooperative gateway a Quality

Factory service that is responsible for evaluatingthe quality of its own data The design of the

Quality Factory is out of the scope of this paperand has been addressed in [7]

The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering

function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function

The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4

The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the

ARTICLE IN PRESS

-BusinessName-Code

Business

-FiscalCode-Address-Name

Owner

1

(a)

-FiscalCode-Address-Name

Owner

-EnterpriseName-Code

Enterprise

1

(b)

-EntName-Code-LegalAddress

Enterprise

-FiscalCode-Address-Name

Holder

1

(c)

Fig 3 Local schemas used in the running example conceptual

representation (a) Local schema of INAIL (b) local schema of

INPS (c) Local schema of CoC

M Scannapieco et al Information Systems 29 (2004) 551ndash582 555

Quality Notification Service is deployed as a peer-to-peer system

The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]

21 Running example

We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues

In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information

2Examples of business events are the starting of a new

businesscompany or the closing down variations in legal

status board composition and senior management variations

in the number of employees as well as opening a new location

filling for a patent etc

that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation

Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality

In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown

3 The D2Q model

According to the DaQuinCIS architecture allcooperating organizations export their application

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 553

on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality

The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored

The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture

1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality

inside Cooperative Information Systemsrsquorsquo (httpwwwdis

uniroma1it~dq) is a joint research project carried out by

DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita

di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano

2 A cooperative framework for data quality

In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system

As an example a supply chain in which someenterprises offer basic products and some others

assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]

A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered

A cooperative information system is formed by a

set of organizations fOrg1yOrgng that cooperate

through a communication infrastructure N which

provide software services to organizations as wells

as reliable connectivity Each organization Orgi is

connected to N through a gateway Gi on which

software services offered by Orgi to other organiza-

tions are deployed A user is a software or human

entity residing within an organization and using the

cooperative systemSeveral CISrsquos are characterized by a high degree

of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones

Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations

In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows

ARTICLE IN PRESS

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

Orgn

CooperativeGateway

Orgn

RatingService

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

OrgnOrgn

RatingService

CooperativeGateway

Fig 2 The DaQuinCIS architecture

M Scannapieco et al Information Systems 29 (2004) 551ndash582554

the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2

Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q

model is described in Section 3In order to produce data and quality data

according to the D2Q model each organizationdeploys on its cooperative gateway a Quality

Factory service that is responsible for evaluatingthe quality of its own data The design of the

Quality Factory is out of the scope of this paperand has been addressed in [7]

The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering

function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function

The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4

The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the

ARTICLE IN PRESS

-BusinessName-Code

Business

-FiscalCode-Address-Name

Owner

1

(a)

-FiscalCode-Address-Name

Owner

-EnterpriseName-Code

Enterprise

1

(b)

-EntName-Code-LegalAddress

Enterprise

-FiscalCode-Address-Name

Holder

1

(c)

Fig 3 Local schemas used in the running example conceptual

representation (a) Local schema of INAIL (b) local schema of

INPS (c) Local schema of CoC

M Scannapieco et al Information Systems 29 (2004) 551ndash582 555

Quality Notification Service is deployed as a peer-to-peer system

The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]

21 Running example

We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues

In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information

2Examples of business events are the starting of a new

businesscompany or the closing down variations in legal

status board composition and senior management variations

in the number of employees as well as opening a new location

filling for a patent etc

that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation

Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality

In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown

3 The D2Q model

According to the DaQuinCIS architecture allcooperating organizations export their application

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

Orgn

CooperativeGateway

Orgn

RatingService

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Quality NotificationService (QNS)

Data QualityBroker (DQB)

QualityFactory (QF)

back-end systems

internals of theorganization

Org1

CooperativeGateway

Org2

CooperativeGateway

Org2

CommunicationInfrastructure

CooperativeGateway

OrgnOrgn

RatingService

CooperativeGateway

Fig 2 The DaQuinCIS architecture

M Scannapieco et al Information Systems 29 (2004) 551ndash582554

the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2

Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q

model is described in Section 3In order to produce data and quality data

according to the D2Q model each organizationdeploys on its cooperative gateway a Quality

Factory service that is responsible for evaluatingthe quality of its own data The design of the

Quality Factory is out of the scope of this paperand has been addressed in [7]

The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering

function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function

The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4

The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the

ARTICLE IN PRESS

-BusinessName-Code

Business

-FiscalCode-Address-Name

Owner

1

(a)

-FiscalCode-Address-Name

Owner

-EnterpriseName-Code

Enterprise

1

(b)

-EntName-Code-LegalAddress

Enterprise

-FiscalCode-Address-Name

Holder

1

(c)

Fig 3 Local schemas used in the running example conceptual

representation (a) Local schema of INAIL (b) local schema of

INPS (c) Local schema of CoC

M Scannapieco et al Information Systems 29 (2004) 551ndash582 555

Quality Notification Service is deployed as a peer-to-peer system

The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]

21 Running example

We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues

In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information

2Examples of business events are the starting of a new

businesscompany or the closing down variations in legal

status board composition and senior management variations

in the number of employees as well as opening a new location

filling for a patent etc

that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation

Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality

In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown

3 The D2Q model

According to the DaQuinCIS architecture allcooperating organizations export their application

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

-BusinessName-Code

Business

-FiscalCode-Address-Name

Owner

1

(a)

-FiscalCode-Address-Name

Owner

-EnterpriseName-Code

Enterprise

1

(b)

-EntName-Code-LegalAddress

Enterprise

-FiscalCode-Address-Name

Holder

1

(c)

Fig 3 Local schemas used in the running example conceptual

representation (a) Local schema of INAIL (b) local schema of

INPS (c) Local schema of CoC

M Scannapieco et al Information Systems 29 (2004) 551ndash582 555

Quality Notification Service is deployed as a peer-to-peer system

The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]

21 Running example

We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues

In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information

2Examples of business events are the starting of a new

businesscompany or the closing down variations in legal

status board composition and senior management variations

in the number of employees as well as opening a new location

filling for a patent etc

that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation

Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality

In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown

3 The D2Q model

According to the DaQuinCIS architecture allcooperating organizations export their application

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582556

data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35

31 Data quality dimensions

Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]

In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]

In the following the general concept of schema

element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency

311 Accuracy

In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically

Accuracy is the distance between v and v0 being v0

the value considered as correctLet us consider the following example Citizen

is a schema element with an attribute Name and p

is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names

Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another

Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined

312 Completeness

Completeness is the degree to which values of a

schema element are present in the schema element

instanceAccording to such a definition we can consider

(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present

A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others

In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values

As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

3Basic types are the ones provided by the most common

programming languages and SQL that is Integer Real

Boolean String Date Time Interval Currency Any

M Scannapieco et al Information Systems 29 (2004) 551ndash582 557

citizen has an e-mail address which has not beenstored this is a case of low completeness

313 Currency

The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-

Birth can be considered invariant Currency canbe defined as

Currency is the distance between the instant when

a value changes in the real world and the instant

when the value itself is modified in the information

systemMore complex definitions of currency have been

proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work

Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]

314 Consistency

Consistency implies that two or more values donot conflict each other

A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as

Consistency is the degree to which the values of

the attributes of an instance of a schema element

satisfy the specific set of semantic rules defined on

the schema elementAs an example if we consider Citizen with

attributes Name DateOfBirth Sex and DateOf-

Death some possible semantic rules to be checkedare

the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency

the value of pDateOfBirth must precede thevalue of pDateOfDeath

Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])

32 The D2Q data model

The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q

quality schemas corresponding to the qualityportion of the D2Q model

Definition 31 (Data class) A data class

d ethnamedp1y pnTHORN consists of

a name named a set of properties pi frac14 namei typeiS i frac14

1y n nX1 where namei is the name of theproperty pi and typei can be

3

either a basic type3

3

or a data class

3

or a type set-of XS where XS can beeither a basic type or a data class

Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features

a set of nodes ND each of which is a data class a set of leaves T D that are all basic type

properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14

LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

EnterpriseEnterpriseClass

OwnerOwnerClass

Namestring Codestring

1 1

M Scannapieco et al Information Systems 29 (2004) 551ndash582558

a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1

4

FiscalCodestring

Addressstring Namestring

11

1

Fig 4 A D2Q data schema of the running example

A D2Q data schema must have the twofollowing properties

Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD

Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local

schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties

33 The D2Q quality model

This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class

and quality schemaBesides basic data types we define a set of types

that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as

-

4W

for m

01 1x and

1minim

taccuracy tcompleteness tconsistency tcurrency

the quality types considered in our modelMore specifically each quality type is defined

over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate

e assume that it is possible to specify the following values

inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where

y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the

um cardinality is 1

that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg

A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi

isthe quality class associated to pi In the followingwe will first define ldpi

where pi is a basic typeproperty then we define the quality class ld

Definition 33 (ldpi-Quality class associated to a

(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi

be a basic type property of dThe quality class ldpi

is defined by

a name nameldpi

four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of

a name nameld a set of properties ethkaccuracykcompleteness

kconsistencykcurrencypl1y plnTHORN such that

3

kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 559

3

ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as

ndash

if pi is a basic type property then typeli is

ldpi where ldpi

is the quality class asso-ciated to d pi

ndash

if pi is a set of XS type where X is abasic type then typeli is set of ldpi

Swhere ldpi

is the quality class associated tod pi

ndash

if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0

ndash

if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0

Definition 35 (D2Q Quality Schema) A D2Q

quality schema SQ is a direct node- and edge-

labelled graph with the following features

a set of nodes NQ each of which is a qualityclass

a set of leaves T Q that are quality typeproperties

a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q

where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels

Enterprise_Quality

Owner_QualityOwnerQualityClass

FiscalCode_QualityFiscalCodeQualityClass

Address_QualityAddressQualityClass

QualityN

Owner_Accuracyτ

hellip

hellip hellip

FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy

EnterprAccur

1 1 1

accuracy

Fig 5 A D2Q quality schema

consisting of quality property namequality typeS

a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows

3

Ente

Namame

a

hellipise_acy

of

if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality

3

if n2 is a quality class then l12 frac14number

of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1

Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality

are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the

rpriseQualityClass

e_QualityClass

Name_QualityName_QualityClass

Name_ccuracy

Code_QualityCodeQualityClass

hellip

hellip

hellip

Code_Accuracyτ

Name_Accuracyτ

1

1

accuracy

accuracy

the running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582560

domains of data quality dimension types to such ascope

34 The D2Q model data + quality

In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q

data schema and the D2Q quality schemaSpecifically we define the notion of quality

association between nodes of the two graphs

Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q

data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ

A quality association is a functionqualityAss A-B such that

(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function

It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions

Enterprise

Owner

NameCode

Owner_Quality

quality

qualityhelliphellip

quality

quality

1 1

Fig 6 A D2Q schema of t

Definition 37 (D2Q Schema) Given the D2Q

data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures

a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ

a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where

EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ

a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ

where LEDQ frac14 fqualityg is a set of quality

labels

The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)

341 D2Q schema instances

The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q

model

Enterprise_Quality

Name_Quality

Enterprise_accuracy

Name_accuracy

Code_Quality

Code_accuracy

1

1

1

hellip

hellip

hellip

he running example

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 561

More specifically

data classes instances are data objects similarly quality classes instances are quality

objects quality association values corresponds to qual-

ity links

A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type

property name basic type property valueSEdges are not labelled at all

Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type

property name quality type property valueSEdges are not labelled at all

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

[APTA]

Code[AssociazionePromozione hellip]

Owner1

Owner2

EntNameFiscalCode

Name

[MCLMSM73H17H501F] [Via dei Gracchi 71

00192 Roma Italy]

[Massimo Mecella]

Address

FiscalCode

[SCNMNC74S60G230T]

[S

Enterprise1

Code_Qua

EntN

FiscalC

Enterp

Code_Accuracy[high] EntNa

[high

quality

Fig 7 D2Q schema instance

Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label

Fig 7 shows the graph representation of a D2Q

schema instance of the running example TheEnterprise1 object instance of the Enterprise

class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)

35 Implementation of the D2Q model

In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

Name

[Via Mazzini 27 84016 Pagani (SA) Italy]Monica

cannapieco]

Address

lity

ame_Quality

ode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

rise_Quality

me_Accuracy]

Owner1_Quality

Address_Accuracy[high]

Address_Currency[medium]

Name_Accuracy[high]

Owner2_Quality

FiscalCode_QualityName

FiscalCode_Accuracy[high]

Address_Quality

Address_Accuracy[medium]

Name_Accuracy[medium]

quality

quality

of the running example

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Fig 8 An example of definition of a quality type as an XML

Schema type

M Scannapieco et al Information Systems 29 (2004) 551ndash582562

The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q

schema instances as XML documents is alsodescribed in the following of this section

351 From a D2Q schema to an XML schema

D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]

Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them

Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate

The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be

identified by explicitly introducing Object Identi-fiers (OIDs)

Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects

The principal choices for the representation ofD2Q schemas as XML Schemas were

representing data and quality classes and alsotheir properties as XML elements

limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association

definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model

definition of a root element for each D2Q dataschema

definition of a root element for each D2Q

quality schema

An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types

The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion

352 From a D2Q schema instance to an XML file

The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Fig 9 XML Schema of the D2Q data schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 563

is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules

a root element is defined to contain all dataobjects

a root element is defined to contain all qualityobjects

an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality

a property of a data class is mapped into anXML element

similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it

a property of a quality class is also mapped intoan XML element

We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9

4 The Data Quality Broker

The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation

41 Overview

The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q

global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Fig 10 XML Schema of the D2Q quality schema of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582564

Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway

in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried

A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 565

mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)

The Data Quality Broker performs two specifictasks

query processing and quality improvement

The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43

42 Query processing

The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q

model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved

Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582566

and we select the copy to return as a result on thebasis of such values

Specifically the detailed steps we follow toanswer a query with quality constraints are

(1)

Table

Map

Glob

Enter

Enter

Enter

Hold

Hold

y

y

Let Q be a query posed on the global schemaG

(2)

The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources

(3)

-Name-ID-LegalAddress

Enterprise

The execution of the queries Q1yQn

returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]

(4)

-FiscalCode-Address-Name

Holder

1

Fig 12 Graphical representation of the global schema

The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values

1

ping between global and local schemas

al schema INPS schema

prise Enterprise

priseID EnterpriseCode

priseLegalAddress X

er Owner

erName OwnerName

y

y

421 Example of query processing

In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1

Let us consider the following XQuery query tobe posed on the global schema

for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)

Enterprise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-

INAIL schema CoC schema

Business Enterprise

BusinessCode EnterpriseCode

X EnterpriseLegalAddress

Owner Holder

OwnerName HolderName

y y

y y

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Accuracy

item

2

Acc

Name=

medium

Acc

Address

frac14medium

Acc

Namefrac14

high

Italyrsquorsquo

Acc

Address

frac14high

Acc

Namefrac14

low

Acc

Address

frac14low

M Scannapieco et al Information Systems 29 (2004) 551ndash582 567

sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL

iecorsquorsquo

84016Paganirsquorsquo

piecorsquorsquo

84016Pagani(SA)

Italyrsquorsquo

for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-

prise

where $bCode =lsquolsquoAPTArsquorsquo

returnHolderSf$b=Namegf$b=Addressg=HolderS

Table

2

Queryresultsin

asample

scenario

Organization

Holderitem

1Accuracy

item

1Holderitem

2

INAIL

lsquolsquoMaim

oMecellarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanap

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

INPS

lsquolsquoMMecallarsquorsquo

Acc

Namefrac14

low

lsquolsquoMonicaScanna

lsquolsquoVia

dei

Gracchi71RomaItalyrsquorsquo

Acc

Address

frac14medium

lsquolsquoVia

Mazzini27

CoC

lsquolsquoMassim

oMecellarsquorsquo

Acc

Namefrac14

high

lsquolsquoMScannarsquorsquo

lsquolsquoVia

dei

Gracchi7100192RomaItalyrsquorsquo

Acc

Address

frac14high

lsquolsquoVia

Mazzini27

Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high

Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data

Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582568

instance level reconciliation is performed on thebasis of quality values

According to the shown accuracy values theresult is composed as follows

HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo

is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-

c Name(CoC)) frac14 Acc Name(CoC) frac14 highand

HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo

is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high

Similar considerations lead to the selection ofthe Holder item 2 provided by INPS

43 Internal architecture and deployment

In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown

431 Interaction modes and application interface

The Data Quality Broker exports the followinginterface to its clients

invoke(query mode) for submitting a queryQ over the global schema including quality

Broker Org1 Orgm

Invoke Q

Invoke Q

D1

Dm

Db=Compare D1hellipDm

Db

Invoke Q hellip

hellip

hellip

(a) (b

Fig 13 Broker invocation mode

constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED

valueIn the basic mode the broker accesses all

sources that provide data satisfying the queryand builds the result whereas in the optimized

mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one

Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj

that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required

propose(data) for notifying all sources thathave previously sent data with lower quality

Broker Orgj

Invoke Q

Dj

Dj

Invoke Q

)

s (a) Basic (b) optimised

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 569

than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback

432 Modules of the Data Quality Broker

The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison

Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE

can also perform some query optimizations duringthe processing

Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a

DBQ

invoke() propose()

QE

TE

Cmp

SM

WrDB

Fig 14 Broker modules

read-only access module that is it returns datastored inside organizations without modifyingthem

Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization

Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources

Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one

In the following we detail the behaviour of QE

and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]

The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization

The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types

Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582570

to block the execution of a query to enforceQoS constraints In this case the results arediscarded

Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications

The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE

433 Query processing in the basic mode

In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user

After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts

A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation

Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke

ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi

(1)

QE1 receives the query Q posed on the globalschema as a parameter of the invoke()

function

(2)

QE1 retrieves the mapping with local sources

from SM1

(3)

QE1 performs the unfolding that returns

queries for Org2 and Org3 that are passedto TE1

(4)

TE1 sends the request to Org2 and Org3

(5)

The recipient TErsquos pass the request to the

Wrrsquos modules Each Wr accesses the localdata store to retrieve data

(6)

Wr3 retrieves a query result eg Ra with

accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09

(7)

Each TE sends the result to TE1

(8)

TE1 sends the collected results (Ra Rb) to

QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)

(9)

QE1 sends the selected result to the client

(10)

QE1 starts the feedback process sending

the selected result (Rb with accuracy 09) toTE1

(11)

TE1 sends it to Org3

(12)

TE3 receives the feedback data and makes a

call propose(Rb with accuracy 09)

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5DQB

QE

TE

SM DB

DQB

TE WR

DB

DQB

TE WR

DB

invoke()1

2

3

4

4

5

5

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

DQB

TE

Cmp DBTE WR

DB

DQB

TE WR

DB

8

7

7

6

6

10

9

11 12

propose()

DQB

QE8

(a)

(b)

Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12

5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework

M Scannapieco et al Information Systems 29 (2004) 551ndash582 571

44 Implementation

We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web

Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6

As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582572

Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults

If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]

7Let us remark that each attribute has an associated set of

quality dimensions

5 The quality notification service

The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS

The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification

The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services

In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21

51 Specification

Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality

change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data

Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

subscriber

QNS

QualityFactory

∆Qev

publisher

sub∆Q

∆Qev

Fig 16 Quality Notification Service

8For the sake of simplicity we omit to present aspects related

to user unsubscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582 573

objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization

Therefore a generic Quality Notification Servicesubscription is a triple in the form

subDQ frac14 d frac12 S exprD exprQS

where

d is a data class expressed on the global schema S is an optional set of sources for the data class

d ie organizations containing data objectsbelonging to d according to the mapping rules

exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v

is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-

tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1

At least d must be specified while other para-meters can be left empty (ie set to gt)8

Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below

Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form

DQevfrac14 d d fyqd vSiygS

where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d

Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff

(1)

the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS

(2)

the data object dN satisfies exprD of S

(3)

the set fyqd vSiyg satisfies exprQ of S

Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]

We now introduce the concept of containment

between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1

contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1

exprQ1S

and S2 frac14 d2frac12 S2 exprD2 exprQ2

S then S1 con-

tains S2 iff

(1)

d1 is an ancestor node of d2 in the D2Q acyclic

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Table 3

Examples of subscription to Quality Notification Service

Subscription

S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S

9M

is con

numb

check

M Scannapieco et al Information Systems 29 (2004) 551ndash582574

graph and if both S1agt and S2agt thenS2DS1

(2)

for each data object d that satisfies exprD2 d

also satisfies exprD1

(3)

analogously for each set fyqd vSiyg thatsatisfies exprQ2

also exprQ1is satisfied by

fyqd vSiyg

It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1

contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9

Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example

Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name

attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code

ore precisely if the complexity of checking if a constraint

tained within another is constant being n the sum of the

er of constraints contained in two subscriptions then the

costs Oethn2THORN

is equal to a specific value Finally subscription S4

is the same as the previous one but involves onlydata objects stored in two specific sources

It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others

Concerning matching a valid quality changeevent generated by a Quality Factory might be

Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs

52 Overview of Quality Notification Service

Implementation

The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer

of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-

10 In the following for simplicity we refer to a Quality

Notification Service instance as lsquolsquoa Quality Notification

Servicersquorsquo whenever this is not confusing

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 575

tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source

The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context

Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization

Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic

Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic

In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-

tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability

Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users

To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge

subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi

and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4

in the examples of Table 3 is the subscription S2

itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi

The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications

Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

QNSINPS

S1S2S3

QNSINAIL

S3S4

QNS1 S1QNS2 S3

QNSCoC

QFe1e2

e1

e1

e2

U13

U12

U11

U21

U22

Fig 17 Exploiting Merge Subscriptions

M Scannapieco et al Information Systems 29 (2004) 551ndash582576

two following quality changes events generated inCoC for the dataclass class holder

e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14

lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14

lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS

e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14

lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14

lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS

e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer

Let us finally point out that also subscriptiontraffic is reduced due to the use of merge

subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1

subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1

Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic

Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event

Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

QNSp

QNS2

QNS5

QNS1

QNS4

QNS6

QNS3

QFe1

e1lt65gt

e1lt43gt

e1lt gte1lt gt

e1lt gt

e1lt gt

Fig 18 Diffusion tree example

M Scannapieco et al Information Systems 29 (2004) 551ndash582 577

mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree

53 Quality notification service internal

architecture

Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components

Internal Interface Layer (IIL) The IIL consumerside comprises the following components

Interpreter receives subscriptions from localusers interprets the subscription language and

stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization

Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users

Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users

The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)

On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization

External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-

Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively

The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External

Merge Table data structure maintains the

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

Quality Factory

LocalSubscription

Table

Interpreter Local Dispatcher

Local Matcher

ExternalMergeTableMerger

ForwarderMerge

SubscriptionTable

External Matcher

Tree Builder

Consumer Producer

notify()subscribe()unsubscribe()

Web Service

changeSub()changeSub()

Web Service

External Diffusion Layer

Internal Interface Layer

SchemaMapper

notify()notify() notify()

Fig 19 Quality Notification Service

11Let us remark that a merge subscription is in the general

case obtained from the union of a set of user subscriptions that

do not satisfy containment Therefore a merge subscription can

be represented as a set of user subscriptions which are stored in

the Merge Subscription Table

M Scannapieco et al Information Systems 29 (2004) 551ndash582578

merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC

Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18

The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components

Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the

Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper

Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method

Let us conclude this section by pointing out that inessence Quality Notification Service is a content-

based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 579

matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)

6 Related work

Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]

In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics

For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment

Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for

CISrsquos in which the requirements on data to beexchanged are very strict

When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]

The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level

conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582580

instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems

For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications

7 Conclusions and future work

Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-

zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations

Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation

The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues

An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified

We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582 581

algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement

Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]

Acknowledgements

This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni

Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions

Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality

References

[1] G De Michelis E Dubois M Jarke F Matthes

J Mylopoulos M Papazoglou K Pohl J Schmidt

C Woo E Yu Cooperative Information Systems A

Manifesto in M Papazoglou G Schlageter (Eds)

Cooperative Information Systems Trends amp Directions

Accademic Press New York 1997

[2] C Batini M Mecella Enabling Italian e-Government

through a cooperative architecture IEEE Computer 34 (2)

[3] P Bertolazzi M Scannapieco Introducing Data Quality

in a Cooperative Context in Proceedings of the 6th

International Conference on Information Quality

(ICIQrsquo01) Boston MA USA 2001

[4] R Wang A Product Perspective on Total Data Quality

Management Commun ACM 41 (2)

[5] Y Wand R Wang Anchoring data quality dimensions in

ontological foundations Commun ACM 39 (11)

[6] T Redman Data Quality for the Information Age Artech

House 1996

[7] C Cappiello C Francalanci B Pernici P Plebani

M Scannapieco Data quality assurance in cooperative

information systems a multi-dimension quality certificate

in Proceedings of the ICDTrsquo03 International Workshop

on Data Quality in Cooperative Information Systems

(DQCISrsquo03) Siena Italy 2003

[8] M Lenzerini Data integration a theoretical perspective

in Proceedings of the 21st ACM Symposium on Principles

of Database Systems (PODS 2002) Madison Wisconsin

USA 2002

[9] L De Santis M Scannapieco T Catarci Trusting data

quality in cooperative information systems in Proceedings

of the 11th International Conference on Cooperative

Information Systems (CoopISrsquo03) Catania Italy 2003

[10] P Aimetti C Batini C Gagliardi M Scannapieco The

lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data

Quality Improvement in the Italian Public Administration

in Experience Paper at the 7th International Conference on

Information Quality (ICIQrsquo02) Boston MA USA 2001

[11] P Hall G Dowling Approximate string matching ACM

Comput Surveys 12 (4)

[12] L Pipino Y Lee R Wang Data quality assessment

Commun ACM 45 (4)

[13] D Ballou R Wang H Pazer G Tayi Modeling

information manufacturing systems to determine informa-

tion product quality Manage Sci 44 (4)

[14] P Missier M Scannapieco C Batini Cooperative

Architectures Introducing Data Quality Technical Report

14-2001 Dipartimento di Informatica e Sistemistica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001

[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)

Temporal Databases Benjamin-Cummings Menlo Park

CA 1993

[16] E Codd Relational database a practical foundation for

productivity Commun ACM 25 (2)

[17] R Bruni A Sassano Errors detection and correction in

large scale data collecting in Proceedings of the Fourth

International Conference on Advances in Intelligent Data

Analysis Cascais Portugal 2001

[18] M Fernandez A Malhotra J Marsh M Nagy

N Walshand XQuery 10 and XPath 20 Data Model

W3C Working Draft available from http

wwww3orgTRquery-datamodel (November 2002)

[19] H Thompson D Beech M Maloney N Mendelsohn

XML Schema Part 1 Structures W3C Recommendation

available from httpwwww3orgTRxmlschema-1

(May 2001)

[20] P Biron A Malhotra XML Schema Part 2 Datatypes

W3C Recommendation available from http

wwww3orgTRxmlschema-2 (May 2001)

[21] S DeRose E Maler D Orchard XML Linking Language

(XLink) Version 10 W3C Recommendation available

from httpwwww3orgTRxlink (June 2001)

[22] S DeRose R Daniel P Grosso E Maler J Marsh

N Walsh XML Pointer Language (XPointer) W3C

Recommendation available from httpwwww3org

TRxptr (August 2002)

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313

ARTICLE IN PRESS

M Scannapieco et al Information Systems 29 (2004) 551ndash582582

[23] J Ullman Information Integration Using Logical Views

in Proceedings of the 6th International Conference on

Database Theory (ICDT rsquo97) Delphi Greece 1997

[24] S Boag D Chamberlin M Fernandez D Florescu

J Robie J Simeon XQuery 10 An XML Query

Language W3C Working Draft available from

httpwwww3orgTRxquery (November 2002)

[25] D Milano Tesi di Laurea in Ingegneria Informatica

Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria

(2003 (in Italian) (the thesis is available by writing an

e-mail to monscandisuniroma1it))

[26] C Hwang K Yoon Multiple Attribute Decision Making

in Lectures Notes in Economics and Mathematical

Systems Vol 186 Springer Berlin 1981

[27] T Saaty The Analytic Hierarchy Process McGraw-Hill

New York 1980

[28] A Buchmann F Casati L Fiege M Hsu M Shan

(Eds) Proceedings of the 3rd VLDB International Work-

shop on Technologies for e-Services (VLDB-TES 2002)

Hong Kong China 2002

[29] T Chandra S Toueg Unreliable failure detectors for

reliable distributed systems J ACM (JACM) 43 (2) (1996)

225ndash267

[30] M Fischer N Lynch M Paterson Impossibility of

distributed consensus with one faulty process J ACM

(JACM) 32 (2) (1985) 374ndash382

[31] MK Aguilera RE Strom DC Sturman M Astley

TD Chandra Matching events in a content-based

subscription system in Symposium on Principles of

Distributed Computing 1999 pp 53ndash61

[32] F Fabret HA Jacobsen F Llirbat J Pereira

KA Ross D Shasha Filtering algorithms and imple-

mentation for very fast publishsubscribe systems SIG-

MOD Record (ACM Special Interest Group on

Management of Data) 30 (2) (2001) 115ndash126

[33] F Naumann U Leser J Freytag Quality-driven integra-

tion of heterogenous information systems in Proceedings

of 25th International Conference on Very Large Data Bases

(VLDBrsquo99) Edinburgh Scotland UK 1999

[34] L Berti-Equille Quality-extended query processing for

distributed processing in Proceedings of the ICDTrsquo03

International Workshop on Data Quality in Cooperative

Information Systems (DQCISrsquo03) Siena Italy 2003

[35] I Fellegi A Sunter A theory for record linkage J Am

Stat Assoc 64

[36] IP Fellegi D Holt A systematic approach to automatic

edit and imputation J Am Stat Assoc 71 (1976) 17ndash35

[37] M Hernandez S Stolfo Real-world data is dirty data

cleansing and the mergepurge problem J Data Min

Knowledge Discovery 1 (2)

[38] H Galhardas D Florescu D Shasha E Simon An

extensible framework for data cleaning in Proceedings of

the 16th International Conference on Data Engineering

(ICDE 2000) San Diego CA USA 2000

[39] R Wang H Kon S Madnick Data quality requirements

analysis and modeling in Proceedings of the Ninth

International Conference on Data Engineering (ICDE

rsquo93) Vienna Austria 1993

[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)

Fundamentals of Data Warehouses Springer Berlin

1995

[41] G Mihaila L Raschid M Vidal Using quality of data

metadata for source selection and ranking in Proceedings

of the Third International Workshop on the Web and

Databases (WebDBrsquo00) Dallas Texas 2000

[42] L Yan M Ozsu Conflict tolerant queries in AURORA

in Proceedings of the Fourth International Conference on

Cooperative Information Systems (CoopISrsquo99) Edin-

burgh Scotland UK 1999

[43] K Sattler S Conrad G Saake Interactive example-

driven integration and reconciliation for accessing data-

base integration Inform Syst 28

[44] K Fan H Lu S Madnick D Cheung Discovering and

reconciling value conflicts for numerical data integration

Inform Syst 28

[45] S Bressan CH Goh K Fynn M Jakobisiak

K Hussein H Kon T Lee S Madnick T Pena J

Qu A Shum M Siegel The context interchange mediator

prototype in Proceedings of the ACM SIGMOD Inter-

national Conference on Management of Data Tucson

Arizona USA 1997

[46] P Eugster S Handurukande R Guerraoui A Kermar-

rec P Kouznetsov Lightweight probabilistic broadcast

in Proceedings of the International Conference on

Dependable Systems and Networks (DSN 2001) July

2001

[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret

K Ross D Shasha Publishsubscribe on the web at

extreme speed in Proceedings of the ACM SIGMOD

Conference on Management of Data Cairo Egypt 2000

[48] M Castro P Druschel A Kermarrec A Rowston

Scribe A large-scale and decentralized application-level

multicast infrastructure IEEE J Select Areas Commun

20 (8)

[49] A Campailla S Chaki EM Clarke S Jha H Veith

Efficient filtering in publish-subscribe systems using

binary decision diagrams in Proceedings of the Interna-

tional Conference on Software Engineering 2001

pp 443ndash452

[50] KP Birman The process group approach to reliable

distributed computing Commun ACM 12 (36) (1993)

36ndash53

[51] KP Birman A review of experiences with reliable

multicast Software-Pract Exper 9 (29) (1999) 741ndash774

[52] R Piantoni C Stancescu Implementing the Swiss

exchange trading system in Proceedings of the 27th

Annual International Symposium on Fault-Tolerant

Computing (FTCSrsquo97) June 1997 pp 309ndash313